Muown: Row-Norm Control for Muon Optimization

Antonio Orvieto; Bingcong Li; Florian H\"ubler; Kai Lion; Niao He

arxiv: 2605.10797 · v1 · submitted 2026-05-11 · 💻 cs.LG

Muown: Row-Norm Control for Muon Optimization

Kai Lion , Florian H\"ubler , Bingcong Li , Antonio Orvieto , Niao He This is my paper

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerrow-norm controlspectral norm decompositionnon-convex optimizationlanguage model pre-trainingweight decaystochastic gradient descent

0 comments

The pith

Muown treats row-magnitude vectors as explicit optimizer variables under infinity-norm geometry to stabilize Muon and eliminate spectral-norm drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon's upward spectral-norm drift arises mainly from growth in row magnitudes rather than changes in row coherence. Muown addresses this by maintaining the row-magnitude vector as a separate state that receives its own updates under the induced infinity-norm geometry, while the direction component continues to receive standard Muon steps. This decomposition yields a proof of optimal non-convex convergence rates in both deterministic and stochastic settings, with an empirically lower noise coefficient than Muon. On GPT-style pre-training from 124M to 2.7B parameters, the method improves perplexity, widens the range of effective learning rates, and lowers sensitivity to weight decay at negligible overhead.

Core claim

Through a spectral-norm decomposition into row-magnitude and row-coherence factors, Muown promotes the row-magnitude vector to an explicit optimizer variable updated under the induced infinity-norm geometry while applying unmodified Muon to the remaining direction component, attaining optimal non-convex rates under aligned dual norms and a stochastic noise coefficient that stays below Muon's throughout training.

What carries the argument

The row-magnitude vector promoted to an explicit optimizer state and updated under the infinity-norm geometry induced by the row-magnitude/row-coherence decomposition of the spectral norm.

If this is right

Muown attains the optimal non-convex convergence rates in both deterministic and stochastic regimes under dual norms aligned with the row-magnitude and direction geometries.
The stochastic noise coefficient of Muown remains below that of Muon across training.
Muown improves perplexity over Muon, SOAP, AdamW, and Lion on FineWeb-Edu pre-training for models from 124M to 2.7B parameters.
Muown widens the plateau of near-optimal learning rates and reduces sensitivity to weight decay while avoiding spectral-norm drift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same explicit-magnitude treatment could be grafted onto other matrix-based optimizers to control analogous drift phenomena without altering their core update rules.
Because the row-magnitude updates are cheap and shardable, the method scales to models far beyond 2.7B with little added cost.
The decomposition suggests that other matrix factorizations (for example, column norms or Frobenius components) might similarly be promoted to control different forms of optimizer instability.

Load-bearing premise

The row-magnitude factor identified by the spectral-norm decomposition is the main empirical driver of the upward drift observed under Muon.

What would settle it

Run standard Muon while externally clamping or rescaling row magnitudes at each step; if the spectral-norm drift disappears and perplexity matches Muown, the claim that explicit row-magnitude control is required would be falsified.

Figures

Figures reproduced from arXiv: 2605.10797 by Antonio Orvieto, Bingcong Li, Florian H\"ubler, Kai Lion, Niao He.

**Figure 1.** Figure 1: Left two: Drift of the maximum row norm ∥gt∥∞ drives the growth of the spectral norm ∥Wt∥S∞ for an MLP and attention output projection layer in a 500M transformer. When intervening by changing the parameterization s.t. the row magnitudes are fixed to the value at initialization, the systematic spectral norm increase vanishes. Right two: The remaining coherence factor λmax Pt Ct Pt in the spectral norm ex… view at source ↗

**Figure 3.** Figure 3: Perplexity as a function of learning rate for Muon, Muown, Lion, and SOAP on a 124M [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Abstract Illustration of Muown and Muon, showing the parameterization (top), semantics and geometry (middle), and the optimizer step (bottom). Operator Norm Perspective. This viewpoint is formalized by the operator-norm framework of metrized deep learning (Large et al., 2024; Bernstein and Newhouse, 2025), where each layer is equipped with a geometry derived from its input-output semantics, and gradients… view at source ↗

**Figure 4.** Figure 4: Left: Perplexity of Muon and Muown on a 2.7B transformer trained on FineWeb-Edu. Right: Ablation of weight decay and learning rate for Muon and Muown on 124M models after 5B tokens. orthogonality prior of reparameterizations and without the indiscriminate shrinkage of weight decay. 3.2 Muown To turn ∥gt∥∞ into a controllable variable, we reparameterize each linear layer as W(g, R) = Diag(g/ ∥R∥row)R with g… view at source ↗

**Figure 5.** Figure 5: Validation loss of Muon and Muown at three different model widths of 512, 768, 1280 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized effective rank (Roy and Vetterli, 2007) of minibatch gradients for a 124M model. At a given training step, we sample 8 gradients and average their effective ranks. At 500M ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the noise coefficients for Muon and Muown. Shaded areas mark the interquartile range over weights. Relative to Muon, Muown adds only elementwise O(mn) work per layer (the weight-norm decomposition and recomposition together with the Adam update on g ∈ R m) asymptotically dominated by the shared Newton-Schulz iteration at O(m2n) for m ≤ n. We benchmark two variants: a replicated baseline th… view at source ↗

**Figure 8.** Figure 8: Learning rate and width ablation for Muon and Muown using the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of NorMuon and Muown on a 124M model. Both NorMuon (Li et al., 2025) and Muown augment plain Muon with a per-row mechanism on the rows of the 2D weight matrices, but they intervene at different points in the update. Plain Muon orthogonalizes the momentum of ∇WL via Newton–Schulz and applies it directly to the parameter without a rowwise state. NorMuon instead applies a postprocessing step t… view at source ↗

**Figure 10.** Figure 10: ∥Wt∥S∞ for the layers in the 4th, 8th, and 12th transformer block of a 500M model. 2.5 5.0 7.5 10.0 12.5 15.0 Block 4 Max row norm Q 2.5 5.0 7.5 10.0 12.5 15.0 K 2 4 6 8 10 12 V 0 5 10 15 Out 0 5 10 15 FC1 0 10 20 30 40 50 60 FC2 2 4 6 8 10 12 Block 8 Max row norm 2 4 6 8 10 12 2 4 6 8 10 12 0.0 2.5 5.0 7.5 10.0 12.5 15.0 2.5 5.0 7.5 10.0 12.5 15.0 0 10 20 30 40 50 0 5000 10000 15000 20000 Step 2 4 6 8 10… view at source ↗

**Figure 11.** Figure 11: ∥gt∥∞ for the layers in the 4th, 8th, and 12th transformer block of a 500M model. both methods on a 124M model across eleven learning rates spanning 2.5 × 10−4 to 4 × 10−2 . Muown matches or improves on NorMuon at most learning rates that lie near the optimum and exhibits a noticeably wider basin of stability. A.4 Spectral Analysis We provide more detailed results for the empirical association between ∥Wt… view at source ↗

**Figure 12.** Figure 12: p λmax(Pt Ct Pt) for the layers in the 4th, 8th, and 12th transformer block of a 500M model. 0 2500 5000 7500 10000 12500 15000 17500 20000 Step 0 10 20 30 40 50 Value Spectral Norm kWkS1 Max Row Norm kgk1 Muown, Fixed g Muown, ¸ = 0:0 Muon, ¸ = 0:1 Muon, ¸ = 0:0 (a) MLP Layer 1 0 2500 5000 7500 10000 12500 15000 17500 20000 Step 0 5 10 15 20 25 Value Spectral Norm kWkS1 Max Row Norm kgk1 Muown, Fixed g M… view at source ↗

**Figure 13.** Figure 13: Evolution of spectral norm ∥Wt∥S∞ and maximum row norm ∥gt∥∞ over the course of training for linear layers of the 16th transformer block of a 500M model. empirically, suppressing row-norm growth alone already accounts for a significant part of the improvement over Muon: Fixed, which freezes row magnitudes at initialization, outperforms Muon at the smaller learning rates and prevents the divergence at η = … view at source ↗

**Figure 14.** Figure 14: Histogram of singular values for linear layers of the 16th transformer block of a 500M [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Singular value distribution for linear layers of the 16th transformer block of a 500M [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Normalized effective rank (Roy and Vetterli, 2007) of minibatch gradients for a 124M model. At a given training step, we sample 8 gradients and average their effective ranks. Algorithm 2 Muown (SignSGD Version) Require: Initial point W1 ∈ R m×n with nonzero rows; stepsizes η, γ > 0; momentum β1 ∈ [0, 1) Initialize R1 ← W1, g1 ← ∥W1∥row ,M0 ← ∇RL(W1, ξ1), and m0 ← ∇gL(W1, ξ1). for t = 1, 2, . . . do # Upda… view at source ↗

read the original abstract

Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $\ell_\infty$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muown gives Muon a row-magnitude control knob that cuts spectral drift and improves LLM pretraining stability, but the stochastic convergence rate still depends on an empirical noise bound rather than a closed proof.

read the letter

The core contribution is a targeted fix for Muon's spectral norm drift. The authors decompose the norm into a row-magnitude factor and a row-coherence factor, show that the magnitude term drives the upward drift on their trajectories, and then treat the magnitude vector as a separate variable updated under ℓ_∞ geometry while leaving the direction component under standard Muon. That split is new and produces a clean drop-in replacement they call Muown.

Referee Report

2 major / 2 minor

Summary. The paper diagnoses upward spectral-norm drift in Muon (without decoupled weight decay) via a decomposition into row-magnitude and row-coherence factors, identifies row-magnitude as the empirical driver, and introduces Muown: an explicit ℓ_∞ update on the row-magnitude vector while retaining Muon on the orthogonal direction component. It claims optimal non-convex convergence rates in both deterministic and stochastic regimes under a dual norm aligned with the geometry, with the stochastic noise coefficient shown only empirically to remain below Muon’s. Experiments on GPT-style pre-training (124M–2.7B parameters on FineWeb-Edu) report improved perplexity, wider near-optimal learning-rate plateaus, reduced weight-decay sensitivity, and elimination of the drift at negligible overhead.

Significance. If the decomposition is robust and the convergence claims can be made fully rigorous, Muown would provide a principled, low-overhead mechanism for norm control that addresses a concrete failure mode of Muon at scale. The explicit separation of magnitude and direction, together with the dual-norm alignment, offers a clean geometric insight that could generalize to other matrix optimizers. The reported empirical gains across model scales are practically relevant for language-model training, but the stochastic-rate guarantee currently rests on an empirical noise-coefficient comparison rather than a closed theoretical bound.

major comments (2)

[Abstract / Stochastic Convergence Theorem] Abstract and convergence analysis (stochastic regime): the claim that Muown attains the optimal non-convex rate 'with a stochastic noise coefficient that empirically remains below that of Muon' makes the rate guarantee conditional on an observed trajectory property rather than a proven bound. Because the noise coefficient is verified only on the reported FineWeb-Edu runs, the analysis does not independently establish the rate; any hidden dependence on Muon’s fitted behavior undermines the 'attains … rates' statement.
[Decomposition and Diagnosis] Spectral-norm decomposition (motivation section): the identification of row-magnitude as the 'empirical driver' of drift is presented as an observation along the Muon trajectory. To justify treating row-magnitude as the primary variable to control, the paper should quantify the relative contribution of row-coherence (e.g., via an ablation or bound showing its variation is negligible) rather than relying on the qualitative statement that it 'remains well-behaved'.

minor comments (2)

[Experiments] The abstract states that Muown 'avoids the spectral norm drift at negligible step-time overhead when appropriately sharded'; the sharding strategy and measured overhead should be detailed with a table or figure so readers can reproduce the claim.
[Introduction / Notation] Notation for the row-magnitude vector and the induced ℓ_∞ geometry is introduced without an explicit equation reference in the abstract; a short displayed equation defining the decomposition (row-magnitude × row-coherence) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, indicating the revisions we will incorporate to strengthen the presentation and analysis.

read point-by-point responses

Referee: [Abstract / Stochastic Convergence Theorem] Abstract and convergence analysis (stochastic regime): the claim that Muown attains the optimal non-convex rate 'with a stochastic noise coefficient that empirically remains below that of Muon' makes the rate guarantee conditional on an observed trajectory property rather than a proven bound. Because the noise coefficient is verified only on the reported FineWeb-Edu runs, the analysis does not independently establish the rate; any hidden dependence on Muon’s fitted behavior undermines the 'attains … rates' statement.

Authors: We agree that the wording in the abstract and theorem statement could be read as claiming an unconditional optimal rate. The stochastic theorem establishes the standard optimal non-convex rate whenever the noise coefficient is bounded by a constant; the manuscript then reports that this coefficient is observed to be smaller than Muon’s on the FineWeb-Edu trajectories. We will revise the abstract and the statement of the stochastic theorem to read that Muown attains the optimal rate with a noise coefficient that is empirically smaller than Muon’s throughout the reported training runs. We will also add a short paragraph in the convergence section noting that the comparison is empirical and dataset-specific, while the theorem itself holds for any optimizer satisfying the bound. This clarifies the scope without altering the underlying analysis. revision: yes
Referee: [Decomposition and Diagnosis] Spectral-norm decomposition (motivation section): the identification of row-magnitude as the 'empirical driver' of drift is presented as an observation along the Muon trajectory. To justify treating row-magnitude as the primary variable to control, the paper should quantify the relative contribution of row-coherence (e.g., via an ablation or bound showing its variation is negligible) rather than relying on the qualitative statement that it 'remains well-behaved'.

Authors: The referee correctly notes that the claim rests on an empirical observation. To make the motivation more rigorous, we will add quantitative support in the revised motivation section: (i) plots of the relative change in the row-magnitude factor versus the row-coherence factor over the course of training for multiple model sizes, and (ii) a simple variance comparison or ratio bound demonstrating that the coherence factor exhibits substantially smaller variation than the magnitude factor. These additions will be placed immediately after the decomposition and will justify the decision to control the row-magnitude vector explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or proof

full rationale

The paper motivates Muown from an empirical observation of spectral-norm drift via row-magnitude decomposition, then introduces an explicit update rule for that component under ℓ_∞ geometry. The central claim is a mathematical proof that the resulting algorithm attains standard optimal non-convex rates (deterministic and stochastic) under a dual norm aligned with the induced geometries. The stochastic-noise-coefficient comparison is presented only as an empirical verification on the reported runs, not as an input that is fitted and then renamed as a prediction. No equation or step in the provided abstract reduces the rate result to a self-definition, a fitted parameter, or a self-citation chain; the analysis is offered as independent of the motivating empirical diagnosis. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the row-magnitude versus row-coherence decomposition as the driver of spectral drift and on the dual-norm alignment assumptions required for the stated optimal non-convex rates.

axioms (2)

domain assumption Spectral norm decomposes into row-magnitude factor and row-coherence factor, with the former driving the upward drift under Muon.
This decomposition is presented as the empirical diagnosis motivating the method.
domain assumption The dual norm is aligned with the underlying row-magnitude and direction geometries for the convergence analysis.
Invoked to establish the optimal non-convex rates in deterministic and stochastic settings.

pith-pipeline@v0.9.0 · 5561 in / 1623 out tokens · 76219 ms · 2026-05-12T03:49:43.029803+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor... updating it under the ℓ_∞ geometry... dual norm aligned with the underlying geometries
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and initial Peano algebra unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Ahn, Kwangjun and Xu, Byron and Abreu, Natalie and Langford, John , year = 2025, note =. Dion:

work page 2025
[2]

Niccolò Ajroldi , year = 2024, howpublished =

work page 2024
[3]

Balles, Lukas and Pedregosa, Fabian and Roux, Nicolas Le , year = 2020, note =. The

work page 2020
[4]

Beck, Amir and Teboulle, Marc , year = 2003, journal =. Mirror

work page 2003
[5]

Deriving

Bernstein, Jeremy , year = 2025, url =. Deriving

work page 2025
[6]

Bernstein, Jeremy and Newhouse, Laker , year = 2024, booktitle =. Old

work page 2024
[7]

Pythia: a suite for analyzing large language models across training and scaling , author =

work page
[8]

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year = 2022, booktitle =

work page 2022
[9]

Optimization

Bottou, L. Optimization. SIAM review , publisher =

work page
[10]

Convex Optimization , author =

work page
[11]

Chen, Lizhang and Li, Jonathan and Liu, Qiang , year = 2025, booktitle =. Muon

work page 2025
[12]

Shen, Wei and Huang, Ruichuan and Huang, Minhui and Shen, Cong and Zhang, Jiawei , year = 2025, note =. On the

work page 2025
[13]

D'Angelo, Francesco and Andriushchenko, Maksym and Varre, Aditya and Flammarion, Nicolas , year = 2024, booktitle =. Why

work page 2024
[14]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year = 2022, booktitle =

work page 2022
[15]

Muon: An

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , year = 2024, url =. Muon: An

work page 2024
[16]

Karpathy, Andrej , year = 2022, url =

work page 2022
[17]

Rotational

Kosson, Atli and Messmer, Bettina and Jaggi, Martin , year = 2024, month = jun, urldate =. Rotational

work page 2024
[18]

Li, Zichong and Liu, Liming and Liang, Chen and Chen, Weizhu and Zhao, Tuo , year = 2025, note =. Nor

work page 2025
[19]

Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and Chen, Yanru and Zheng, Huabin and Liu, Yibo and Liu, Shaowei and Yin, Bohong and He, Weiran and Zhu, Han and Wang, Yuzhi and Wang, Jianzhou and Dong, Mengnan and Zhang, Zheng and Kang, Yongsheng a...

work page 2025
[20]

and Foster, Dylan J

Arjevani, Yossi and Carmon, Yair and Duchi, John C. and Foster, Dylan J. and Srebro, Nathan and Woodworth, Blake , year = 2023, journal =. Lower

work page 2023
[21]

doi:10.57967/hf/2497 , publisher =

Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , year = 2024, publisher =. doi:10.57967/hf/2497 , url =

work page doi:10.57967/hf/2497 2024
[22]

Nesterov,. A. Doklady Akademii Nauk SSSR , volume = 269, number = 3, pages =

work page
[23]

Spectral

Miyato, Takeru and Kataoka, Toshiki and Koyama, Masanori and Yoshida, Yuichi , year = 2018, note =. Spectral

work page 2018
[24]

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , year = 2024, url =

work page 2024
[25]

Problem Complexity and Method Efficiency in Optimization , author =

work page
[26]

Preston and Cesista, Franz and Zahorodnii, Andrii and Bernstein, Jeremy and Isola, Phillip , year = 2025, note =

Newhouse, Laker and Hess, R. Preston and Cesista, Franz and Zahorodnii, Andrii and Bernstein, Jeremy and Isola, Phillip , year = 2025, note =. Training

work page 2025
[27]

Olivier Roy and Martin Vetterli , year = 2007, booktitle =

work page 2007
[28]

Lions and

Sfyraki, Maria-Eleni and Wang, Jun-Kun , year = 2025, note =. Lions and

work page 2025
[29]

Shazeer, Noam , year = 2020, note =

work page 2020
[30]

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , year = 2024, journal =

work page 2024
[31]

Kimi-Team and Bai, Yifan and Bao, Yiping and Chen, Guanduo and Chen, Jiahao and Chen, Ningxin and Chen, Ruijue and Chen et al., Yanru , year = 2025, note =. Kimi

work page 2025
[32]

, year = 2004, type =

Tropp, Joel A. , year = 2004, type =. Topics in

work page 2004
[33]

Orthogonalising gradients to speed up neural network optimisation , author =

work page
[34]

Implicit

Xie, Shuo and Li, Zhiyuan , year = 2024, note =. Implicit

work page 2024
[35]

An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzh...

work page 2024
[36]

and Bernstein, Jeremy , year = 2024, note =

Yang, Greg and Simon, James B. and Bernstein, Jeremy , year = 2024, note =. A

work page 2024
[37]

Spectral

Yoshida, Yuichi and Miyato, Takeru , year = 2017, note =. Spectral

work page 2017

[1] [1]

Ahn, Kwangjun and Xu, Byron and Abreu, Natalie and Langford, John , year = 2025, note =. Dion:

work page 2025

[2] [2]

Niccolò Ajroldi , year = 2024, howpublished =

work page 2024

[3] [3]

Balles, Lukas and Pedregosa, Fabian and Roux, Nicolas Le , year = 2020, note =. The

work page 2020

[4] [4]

Beck, Amir and Teboulle, Marc , year = 2003, journal =. Mirror

work page 2003

[5] [5]

Deriving

Bernstein, Jeremy , year = 2025, url =. Deriving

work page 2025

[6] [6]

Bernstein, Jeremy and Newhouse, Laker , year = 2024, booktitle =. Old

work page 2024

[7] [7]

Pythia: a suite for analyzing large language models across training and scaling , author =

work page

[8] [8]

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year = 2022, booktitle =

work page 2022

[9] [9]

Optimization

Bottou, L. Optimization. SIAM review , publisher =

work page

[10] [10]

Convex Optimization , author =

work page

[11] [11]

Chen, Lizhang and Li, Jonathan and Liu, Qiang , year = 2025, booktitle =. Muon

work page 2025

[12] [12]

Shen, Wei and Huang, Ruichuan and Huang, Minhui and Shen, Cong and Zhang, Jiawei , year = 2025, note =. On the

work page 2025

[13] [13]

D'Angelo, Francesco and Andriushchenko, Maksym and Varre, Aditya and Flammarion, Nicolas , year = 2024, booktitle =. Why

work page 2024

[14] [14]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year = 2022, booktitle =

work page 2022

[15] [15]

Muon: An

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , year = 2024, url =. Muon: An

work page 2024

[16] [16]

Karpathy, Andrej , year = 2022, url =

work page 2022

[17] [17]

Rotational

Kosson, Atli and Messmer, Bettina and Jaggi, Martin , year = 2024, month = jun, urldate =. Rotational

work page 2024

[18] [18]

Li, Zichong and Liu, Liming and Liang, Chen and Chen, Weizhu and Zhao, Tuo , year = 2025, note =. Nor

work page 2025

[19] [19]

Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and Chen, Yanru and Zheng, Huabin and Liu, Yibo and Liu, Shaowei and Yin, Bohong and He, Weiran and Zhu, Han and Wang, Yuzhi and Wang, Jianzhou and Dong, Mengnan and Zhang, Zheng and Kang, Yongsheng a...

work page 2025

[20] [20]

and Foster, Dylan J

Arjevani, Yossi and Carmon, Yair and Duchi, John C. and Foster, Dylan J. and Srebro, Nathan and Woodworth, Blake , year = 2023, journal =. Lower

work page 2023

[21] [21]

doi:10.57967/hf/2497 , publisher =

Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , year = 2024, publisher =. doi:10.57967/hf/2497 , url =

work page doi:10.57967/hf/2497 2024

[22] [22]

Nesterov,. A. Doklady Akademii Nauk SSSR , volume = 269, number = 3, pages =

work page

[23] [23]

Spectral

Miyato, Takeru and Kataoka, Toshiki and Koyama, Masanori and Yoshida, Yuichi , year = 2018, note =. Spectral

work page 2018

[24] [24]

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , year = 2024, url =

work page 2024

[25] [25]

Problem Complexity and Method Efficiency in Optimization , author =

work page

[26] [26]

Preston and Cesista, Franz and Zahorodnii, Andrii and Bernstein, Jeremy and Isola, Phillip , year = 2025, note =

Newhouse, Laker and Hess, R. Preston and Cesista, Franz and Zahorodnii, Andrii and Bernstein, Jeremy and Isola, Phillip , year = 2025, note =. Training

work page 2025

[27] [27]

Olivier Roy and Martin Vetterli , year = 2007, booktitle =

work page 2007

[28] [28]

Lions and

Sfyraki, Maria-Eleni and Wang, Jun-Kun , year = 2025, note =. Lions and

work page 2025

[29] [29]

Shazeer, Noam , year = 2020, note =

work page 2020

[30] [30]

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , year = 2024, journal =

work page 2024

[31] [31]

Kimi-Team and Bai, Yifan and Bao, Yiping and Chen, Guanduo and Chen, Jiahao and Chen, Ningxin and Chen, Ruijue and Chen et al., Yanru , year = 2025, note =. Kimi

work page 2025

[32] [32]

, year = 2004, type =

Tropp, Joel A. , year = 2004, type =. Topics in

work page 2004

[33] [33]

Orthogonalising gradients to speed up neural network optimisation , author =

work page

[34] [34]

Implicit

Xie, Shuo and Li, Zhiyuan , year = 2024, note =. Implicit

work page 2024

[35] [35]

An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzh...

work page 2024

[36] [36]

and Bernstein, Jeremy , year = 2024, note =

Yang, Greg and Simon, James B. and Bernstein, Jeremy , year = 2024, note =. A

work page 2024

[37] [37]

Spectral

Yoshida, Yuichi and Miyato, Takeru , year = 2017, note =. Spectral

work page 2017