pith. sign in

arxiv: 2605.26459 · v1 · pith:5U6BHF67new · submitted 2026-05-26 · 💻 cs.LG

MuCon: Clipped Muon Updates for LLM Training

Pith reviewed 2026-06-29 19:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizerclipped updatesspectral norm projectionLLM trainingpolar decompositionNewton iterationmatrix clipping
0
0 comments X

The pith

MuCon clips the singular values of Muon update matrices to enforce a spectral-norm bound while preserving the projection property.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines MuCon as the variant of Muon that replaces the polar factor with a singular-value-clipped matrix whose largest singular value is at most tau. It establishes that this clipping map is exactly the Frobenius projection onto the spectral-norm ball. Two exact identities are recorded that let the clipped factor be obtained from a polar decomposition or from a rational Newton iteration on the positive-semidefinite part, avoiding a full dense SVD. The identities hold provided singular values near the clipping threshold are excluded or regularized. The work therefore supplies a practical route to Muon-style updates whose matrix norm is explicitly controlled.

Core claim

MuCon replaces the Muon polar direction Pol(B) with the clipped direction MClip_tau(B) = U diag(min{sigma_i, tau}) V^T. This operator is the orthogonal projection (in Frobenius norm) onto the set of matrices whose spectral norm is at most tau. The clipping step admits an exact polar/absolute-value representation and an exact scalar-root representation that reduces to a rational Newton filter; both representations become numerically ill-conditioned precisely when singular values lie near the threshold tau.

What carries the argument

The MClip_tau operator, the Frobenius projection onto the spectral-norm ball that clips only the singular values exceeding tau.

If this is right

  • The SpectralP scaling parameterization remains valid when the clipped rather than the polar direction is substituted.
  • Matrix-function approximations to MuCon become usable once stable polar or square-root primitives are paired with them.
  • Only the singular directions that violate the norm bound are altered; all other directions stay unchanged.
  • The method supplies an explicit spectral-norm control that standard Muon does not possess.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clipping construction could be applied to other matrix-valued momentum or preconditioner updates that rely on polar factors.
  • Explicit regularization near the threshold may turn the numerical obstruction into a tunable hyper-parameter rather than a hard limitation.
  • If the rational Newton filter proves stable under the regularization, MuCon-style updates could be inserted into existing large-scale training code with only local changes to the update step.

Load-bearing premise

Singular values near the clipping threshold can be handled stably by existing polar or square-root primitives or by explicit regularization, without breaking the approximation on the matrices that appear in LLM training.

What would settle it

A direct numerical comparison, on weight-update matrices drawn from an actual LLM training run, showing that the polar or Newton approximation deviates from the exact clipped SVD by more than machine epsilon when any singular value lies within a small relative distance of tau.

read the original abstract

Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, $D^{\mathrm{MuCon}}\_\tau(B) = \operatorname{MClip}\_\tau(B) = U \operatorname{diag}\bigl(\min\{\sigma\_i,\tau\}\bigr) V^\top, \qquad \tau > 0$. Thus, $\operatorname{MClip}\_\tau$ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called $\text{SpectralP}$: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map $\operatorname{MClip}\_\tau$ is the Frobenius projection onto the spectral-norm ball $\{X : \|X\|_2 \le \tau\}$: it leaves singular values at or below $\tau$ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper defines MuCon as a clipped-Muon optimizer that applies singular-value clipping to the Muon matrix B using the operator MClip_τ(B) = U diag(min{σ_i, τ}) V^T. It claims that MClip_τ is the Frobenius projection onto the spectral-norm ball and records two exact identities for approximating this operator without a full dense SVD: a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter. The paper identifies the numerical obstruction that singular values near the threshold τ render sign decisions and rational solves ill-conditioned, and concludes that matrix-function methods are useful only when paired with stable polar/square-root primitives or explicit regularization.

Significance. If the two exact identities hold as stated, the work provides parameter-free derivations for approximating the clipping step, which is a notable strength for potential use in efficient LLM training optimizers. The explicit identification of the shared numerical obstruction is also a constructive contribution that clarifies the conditions under which the approximations can be applied.

major comments (1)
  1. [Abstract] Abstract: the manuscript states two exact identities (polar/absolute-value formula and scalar-root formulation to rational Newton filter) but supplies neither derivations nor verification, which is load-bearing for confirming the central mathematical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested changes in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript states two exact identities (polar/absolute-value formula and scalar-root formulation to rational Newton filter) but supplies neither derivations nor verification, which is load-bearing for confirming the central mathematical claims.

    Authors: We agree that the derivations and verification of the two exact identities are essential and currently insufficiently detailed. In the revised manuscript we will add complete derivations of the polar/absolute-value identity and the scalar-root formulation leading to the rational Newton filter, together with explicit numerical verification of both identities (including edge cases near the clipping threshold). These additions will be placed in the main text or a dedicated appendix to make the central claims self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines MClip_τ directly as the SVD-based clipping operator and states its projection property onto the spectral-norm ball as a direct consequence of that definition. It records exact mathematical identities for approximation (polar/absolute-value and scalar-root to Newton filter) without any fitted parameters, self-citation chains, or reduction of predictions to inputs. The numerical obstruction is explicitly scoped as a limitation rather than assumed away. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no free parameters, axioms, or invented entities that can be extracted; τ is introduced as a positive scalar hyperparameter but is not fitted or derived within the provided text.

pith-pipeline@v0.9.1-grok · 5859 in / 1184 out tokens · 23584 ms · 2026-06-29T19:54:16.045807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J

    Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don't be lazy: CompleteP enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618, 2025

  2. [2]

    Spectral Condition for $\mu$P under Width-Depth Scaling

    Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Chongxuan Li. Spectral condition for P under width-depth scaling. arXiv preprint arXiv:2603.00541v1, 2026

  3. [3]

    Greg Yang and Edward J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In Proceedings of the International Conference on Machine Learning, 2021

  4. [4]

    Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    Greg Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022

  5. [5]

    arXiv preprint arXiv:2310.17813 , year=

    Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

  6. [6]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan, Yuchen Jin, Vladimir Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Technical blog post, 2024

  7. [7]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  8. [8]

    Nicholas J. Higham. Computing the polar decomposition-with applications. SIAM Journal on Scientific and Statistical Computing, 7(4):1160-1174, 1986

  9. [9]

    Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008

  10. [10]

    Optimizing Halley's iteration for computing the matrix polar decomposition

    Yuji Nakatsukasa and Zhaojun Bai. Optimizing Halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5):2700-2720, 2010

  11. [11]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

  12. [12]

    Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217-288, 2011