Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer
Pith reviewed 2026-05-25 02:54 UTC · model grok-4.3
The pith
Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The inertial Muon dynamics is a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential, and it satisfies an exact Hamiltonian dissipation identity showing that the Hamiltonian energy decreases monotonically.
What carries the argument
The regularized orthogonalization map as the gradient of a smooth Fenchel-dual smoothing of the nuclear norm, which makes the Muon update a mirror/prox step with momentum as the dual variable.
If this is right
- Exponential convergence rates hold for both the continuous-time flow and its discrete-time discretization under the stated extra assumptions.
- The mean-field limit equation is well-posed.
- The interacting particle system satisfies propagation of chaos.
- The formulation extends to Hilbert-valued feature maps, giving a blockwise Muon flow for mixture-of-experts transformers.
Where Pith is reading between the lines
- The Hamiltonian structure may permit symplectic or structure-preserving discretizations that better preserve the dissipation identity than standard momentum methods.
- The probability-measure view suggests direct application to ensemble or mean-field training of neural networks without passing through single-particle limits first.
- Testing the dissipation identity on small-scale transformer layers could reveal whether the monotonicity survives the practical approximations used in Muon implementations.
Load-bearing premise
Exponential convergence of the objective gap requires extra gradient-dominance, bounded-momentum, and curvature/alignment conditions on the objective and the flow.
What would settle it
Numerical integration of the inertial Muon dynamics on a simple quadratic objective that checks whether the Hamiltonian energy decreases monotonically at every step.
Figures
read the original abstract
We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(\rho)=R\left(\int F d \rho\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a gradient flow on probability measures over matrix-valued parameters induced by a regularized version of the Muon optimizer. It identifies the regularized orthogonalization map as the gradient of a smooth Fenchel-dual smoothing of the nuclear norm, lifts the dynamics to finite-particle probability objectives of the form J(ρ)=R(∫F dρ), derives the inertial continuous-time limit, and passes to a phase-space mean-field equation. The resulting flow is shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential, with an exact Hamiltonian dissipation identity proving monotonic decrease of the Hamiltonian energy. Under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, exponential convergence rates are obtained for both continuous and discrete time; well-posedness of the mean-field equation and propagation of chaos are established, with an extension to Hilbert-valued feature maps for blockwise Muon flows in transformer MoE models.
Significance. If the derivations hold, the exact dissipation identity provides a structurally clean Hamiltonian perspective on inertial Muon dynamics that is stronger than generic energy dissipation in gradient flows. The mean-field lifting and propagation-of-chaos results are technically substantive for analyzing large-scale neural network training, and the blockwise extension directly targets practical MoE architectures. The work supplies a concrete bridge between mirror-descent interpretations of orthogonalization and probability-gradient-flow theory.
major comments (2)
- [Abstract] Abstract (key observation paragraph): The identification of the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm is load-bearing for the entire Hamiltonian structure. It is presented as an observation; the manuscript must supply the explicit smoothing functional, the value of the regularization parameter, and the direct verification that its gradient recovers the Muon update, to rule out the possibility that the parameter is chosen post-hoc to enforce the desired mirror-map property.
- [Abstract] Abstract (convergence paragraph): The exponential convergence rates are stated to hold only under gradient-dominance, bounded-momentum, and curvature/alignment assumptions. These assumptions are not shown to be satisfied by standard neural-network losses or by the Muon dynamics itself; the manuscript should either derive verifiable conditions under which they hold or provide a counter-example illustrating when the rates fail.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our work. We address each major comment below and will incorporate revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (key observation paragraph): The identification of the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm is load-bearing for the entire Hamiltonian structure. It is presented as an observation; the manuscript must supply the explicit smoothing functional, the value of the regularization parameter, and the direct verification that its gradient recovers the Muon update, to rule out the possibility that the parameter is chosen post-hoc to enforce the desired mirror-map property.
Authors: We agree that the explicit construction is essential for rigor. In the revised version we will state the smoothing functional explicitly as the Fenchel dual of the nuclear norm plus an ε-quadratic regularizer (with ε the regularization parameter), and include the direct gradient computation verifying that it recovers the regularized orthogonalization map used by Muon. This material will be added to the abstract and expanded in Section 2. revision: yes
-
Referee: [Abstract] Abstract (convergence paragraph): The exponential convergence rates are stated to hold only under gradient-dominance, bounded-momentum, and curvature/alignment assumptions. These assumptions are not shown to be satisfied by standard neural-network losses or by the Muon dynamics itself; the manuscript should either derive verifiable conditions under which they hold or provide a counter-example illustrating when the rates fail.
Authors: We accept that the assumptions require further discussion. The revision will add a remark deriving verifiable conditions (e.g., Polyak-Łojasiewicz inequality on the loss, which holds in certain overparameterized regimes) under which the rates apply, together with a simple counter-example where gradient dominance fails and only sublinear convergence is recovered. These additions will appear in the convergence analysis section. revision: yes
Circularity Check
No significant circularity; derivation self-contained via independent mathematical identification
full rationale
The paper's core chain begins with the stated key observation that the regularized orthogonalization map equals the gradient of a Fenchel-dual smoothing of the nuclear norm. This is used to interpret the update as a mirror step and lift to the probability-measure objective J(ρ), yielding the inertial continuous-time limit and phase-space mean-field equation. The damped Hamiltonian structure and exact dissipation identity are then derived directly from this mirror potential. No equation is shown to equal its input by construction, no parameter is fitted then renamed as prediction, and no load-bearing premise rests on a self-citation chain. The additional convergence assumptions are isolated and do not affect the dissipation identity. The derivation therefore remains independent of its target outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- smoothing/regularization parameter
axioms (1)
- domain assumption The regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm.
invented entities (1)
-
damped Hamiltonian probability dynamics on parameter-momentum pairs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Journal of Scientific Computing , volume=
Accelerated information gradient flow , author=. Journal of Scientific Computing , volume=. 2022 , publisher=
work page 2022
-
[2]
Advances in neural information processing systems , volume=
On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=
-
[3]
Advances in Neural Information Processing Systems , volume=
Efficient constrained sampling via the mirror-Langevin algorithm , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Advances in Neural Information Processing Systems , volume=
Learning rate free sampling in constrained domains , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
arXiv preprint arXiv:2503.12645 , year=
Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization , author=. arXiv preprint arXiv:2503.12645 , year=
-
[6]
Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
- [7]
-
[8]
Journal of machine learning research , volume=
Accelerating optimization over the space of probability measures , author=. Journal of machine learning research , volume=
-
[9]
Muon Dynamics as a Spectral Wasserstein Flow , author=. 2026 , eprint=
work page 2026
-
[10]
Advances in neural information processing systems , volume=
Towards understanding the mixture-of-experts layer in deep learning , author=. Advances in neural information processing systems , volume=
-
[11]
Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
-
[12]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ,author =. 2017 ,URL =
work page 2017
-
[13]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[14]
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author=. 2026 , eprint=
work page 2026
-
[15]
arXiv preprint arXiv:2602.08232 , year=
Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization , author=. arXiv preprint arXiv:2602.08232 , year=
-
[16]
arXiv preprint arXiv:2202.07052 , year=
Orthogonalising gradients to speed up neural network optimisation , author=. arXiv preprint arXiv:2202.07052 , year=
-
[17]
Training Deep Learning Models with Norm-Constrained LMOs
Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2506.15054 , year=
Muon optimizes under spectral norm constraints , author=. arXiv preprint arXiv:2506.15054 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.