MuCon: Clipped Muon Updates for LLM Training

Albert Yi

arxiv: 2605.26459 · v1 · pith:5U6BHF67new · submitted 2026-05-26 · 💻 cs.LG

MuCon: Clipped Muon Updates for LLM Training

Albert Yi This is my paper

Pith reviewed 2026-06-29 19:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerclipped updatesspectral norm projectionLLM trainingpolar decompositionNewton iterationmatrix clipping

0 comments

The pith

MuCon clips the singular values of Muon update matrices to enforce a spectral-norm bound while preserving the projection property.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines MuCon as the variant of Muon that replaces the polar factor with a singular-value-clipped matrix whose largest singular value is at most tau. It establishes that this clipping map is exactly the Frobenius projection onto the spectral-norm ball. Two exact identities are recorded that let the clipped factor be obtained from a polar decomposition or from a rational Newton iteration on the positive-semidefinite part, avoiding a full dense SVD. The identities hold provided singular values near the clipping threshold are excluded or regularized. The work therefore supplies a practical route to Muon-style updates whose matrix norm is explicitly controlled.

Core claim

MuCon replaces the Muon polar direction Pol(B) with the clipped direction MClip_tau(B) = U diag(min{sigma_i, tau}) V^T. This operator is the orthogonal projection (in Frobenius norm) onto the set of matrices whose spectral norm is at most tau. The clipping step admits an exact polar/absolute-value representation and an exact scalar-root representation that reduces to a rational Newton filter; both representations become numerically ill-conditioned precisely when singular values lie near the threshold tau.

What carries the argument

The MClip_tau operator, the Frobenius projection onto the spectral-norm ball that clips only the singular values exceeding tau.

If this is right

The SpectralP scaling parameterization remains valid when the clipped rather than the polar direction is substituted.
Matrix-function approximations to MuCon become usable once stable polar or square-root primitives are paired with them.
Only the singular directions that violate the norm bound are altered; all other directions stay unchanged.
The method supplies an explicit spectral-norm control that standard Muon does not possess.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clipping construction could be applied to other matrix-valued momentum or preconditioner updates that rely on polar factors.
Explicit regularization near the threshold may turn the numerical obstruction into a tunable hyper-parameter rather than a hard limitation.
If the rational Newton filter proves stable under the regularization, MuCon-style updates could be inserted into existing large-scale training code with only local changes to the update step.

Load-bearing premise

Singular values near the clipping threshold can be handled stably by existing polar or square-root primitives or by explicit regularization, without breaking the approximation on the matrices that appear in LLM training.

What would settle it

A direct numerical comparison, on weight-update matrices drawn from an actual LLM training run, showing that the polar or Newton approximation deviates from the exact clipped SVD by more than machine epsilon when any singular value lies within a small relative distance of tau.

read the original abstract

Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, $D^{\mathrm{MuCon}}\_\tau(B) = \operatorname{MClip}\_\tau(B) = U \operatorname{diag}\bigl(\min\{\sigma\_i,\tau\}\bigr) V^\top, \qquad \tau > 0$. Thus, $\operatorname{MClip}\_\tau$ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called $\text{SpectralP}$: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map $\operatorname{MClip}\_\tau$ is the Frobenius projection onto the spectral-norm ball $\{X : \|X\|_2 \le \tau\}$: it leaves singular values at or below $\tau$ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuCon records two exact identities for approximating clipped Muon without full SVD and correctly flags the shared numerical instability near the threshold.

read the letter

The main point is that this paper defines MuCon as the version of Muon that clips singular values at a threshold tau instead of mapping them all to one, shows that the operator is the Frobenius projection onto the spectral-norm ball, and supplies two exact identities for computing the clip without a dense SVD: a polar/absolute-value route and a scalar-root route that yields a rational Newton filter. It also states the common numerical problem that singular values near tau make sign decisions and the rational solves ill-conditioned.

The identities look new relative to the cited Muon work, and the projection property is a clean observation that follows directly from the SVD definition. The paper is straightforward about the limitation and does not overclaim stability, which keeps the argument consistent.

The soft spot is that the abstract supplies no derivations, no numerical verification, and no LLM-scale tests, so it is still unclear how often the approximations would be stable enough to replace a full SVD in practice even with the suggested regularization. That gap is real but not fatal given the scoped claim.

The work is aimed at people already running matrix-valued optimizers on large models and who care about cheap ways to enforce spectral-norm bounds on updates. A reader who wants the specific formulas or who is building on Muon could extract value from the identities once the derivations are checked. It is coherent enough on its own terms to deserve a serious referee who can verify the math and ask for at least one stability experiment.

I would send it to review rather than desk-reject.

Referee Report

1 major / 0 minor

Summary. The paper defines MuCon as a clipped-Muon optimizer that applies singular-value clipping to the Muon matrix B using the operator MClip_τ(B) = U diag(min{σ_i, τ}) V^T. It claims that MClip_τ is the Frobenius projection onto the spectral-norm ball and records two exact identities for approximating this operator without a full dense SVD: a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter. The paper identifies the numerical obstruction that singular values near the threshold τ render sign decisions and rational solves ill-conditioned, and concludes that matrix-function methods are useful only when paired with stable polar/square-root primitives or explicit regularization.

Significance. If the two exact identities hold as stated, the work provides parameter-free derivations for approximating the clipping step, which is a notable strength for potential use in efficient LLM training optimizers. The explicit identification of the shared numerical obstruction is also a constructive contribution that clarifies the conditions under which the approximations can be applied.

major comments (1)

[Abstract] Abstract: the manuscript states two exact identities (polar/absolute-value formula and scalar-root formulation to rational Newton filter) but supplies neither derivations nor verification, which is load-bearing for confirming the central mathematical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested changes in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states two exact identities (polar/absolute-value formula and scalar-root formulation to rational Newton filter) but supplies neither derivations nor verification, which is load-bearing for confirming the central mathematical claims.

Authors: We agree that the derivations and verification of the two exact identities are essential and currently insufficiently detailed. In the revised manuscript we will add complete derivations of the polar/absolute-value identity and the scalar-root formulation leading to the rational Newton filter, together with explicit numerical verification of both identities (including edge cases near the clipping threshold). These additions will be placed in the main text or a dedicated appendix to make the central claims self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines MClip_τ directly as the SVD-based clipping operator and states its projection property onto the spectral-norm ball as a direct consequence of that definition. It records exact mathematical identities for approximation (polar/absolute-value and scalar-root to Newton filter) without any fitted parameters, self-citation chains, or reduction of predictions to inputs. The numerical obstruction is explicitly scoped as a limitation rather than assumed away. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no free parameters, axioms, or invented entities that can be extracted; τ is introduced as a positive scalar hyperparameter but is not fitted or derived within the provided text.

pith-pipeline@v0.9.1-grok · 5859 in / 1184 out tokens · 23584 ms · 2026-06-29T19:54:16.045807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 1 internal anchor

[1]

C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don't be lazy: CompleteP enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618, 2025

work page arXiv 2025
[2]

Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Chongxuan Li. Spectral condition for P under width-depth scaling. arXiv preprint arXiv:2603.00541v1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Greg Yang and Edward J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In Proceedings of the International Conference on Machine Learning, 2021

2021
[4]

Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022

work page arXiv 2022
[5]

arXiv preprint arXiv:2310.17813 , year=

Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023
[6]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vladimir Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Technical blog post, 2024

2024
[7]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019
[8]

Nicholas J. Higham. Computing the polar decomposition-with applications. SIAM Journal on Scientific and Statistical Computing, 7(4):1160-1174, 1986

1986
[9]

Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008

2008
[10]

Optimizing Halley's iteration for computing the matrix polar decomposition

Yuji Nakatsukasa and Zhaojun Bai. Optimizing Halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5):2700-2720, 2010

2010
[11]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

2013
[12]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217-288, 2011

2011

[1] [1]

C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don't be lazy: CompleteP enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618, 2025

work page arXiv 2025

[2] [2]

Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Chongxuan Li. Spectral condition for P under width-depth scaling. arXiv preprint arXiv:2603.00541v1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Greg Yang and Edward J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In Proceedings of the International Conference on Machine Learning, 2021

2021

[4] [4]

Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022

work page arXiv 2022

[5] [5]

arXiv preprint arXiv:2310.17813 , year=

Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023

[6] [6]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vladimir Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Technical blog post, 2024

2024

[7] [7]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019

[8] [8]

Nicholas J. Higham. Computing the polar decomposition-with applications. SIAM Journal on Scientific and Statistical Computing, 7(4):1160-1174, 1986

1986

[9] [9]

Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008

2008

[10] [10]

Optimizing Halley's iteration for computing the matrix polar decomposition

Yuji Nakatsukasa and Zhaojun Bai. Optimizing Halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5):2700-2720, 2010

2010

[11] [11]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

2013

[12] [12]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217-288, 2011

2011