Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

Aratrika Mustafi; Bharath K. Sriperumbudur; Soumya Mukherjee

arxiv: 2605.23871 · v1 · pith:AT4VBOJYnew · submitted 2026-05-22 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

Aratrika Mustafi , Soumya Mukherjee , Bharath K. Sriperumbudur This is my paper

Pith reviewed 2026-05-25 02:54 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords Muon optimizergradient flowHamiltonian dynamicsmean-field limitprobability measuresmirror descentnuclear normconvergence rates

0 comments

The pith

Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper lifts the regularized Muon optimizer to a gradient flow on the space of probability measures whose particles are matrix-valued parameters. The central step is recognizing that the regularized orthogonalization map equals the gradient of a smooth Fenchel dual of the nuclear norm, turning the Muon step into a mirror-prox update. This structure produces an inertial continuous-time limit that becomes a phase-space mean-field equation over joint laws on parameters and momenta. The resulting dynamics is a damped Hamiltonian probability flow whose kinetic energy comes from the Muon mirror potential, and the flow obeys an exact dissipation identity that makes the Hamiltonian energy decrease monotonically.

Core claim

The inertial Muon dynamics is a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential, and it satisfies an exact Hamiltonian dissipation identity showing that the Hamiltonian energy decreases monotonically.

What carries the argument

The regularized orthogonalization map as the gradient of a smooth Fenchel-dual smoothing of the nuclear norm, which makes the Muon update a mirror/prox step with momentum as the dual variable.

If this is right

Exponential convergence rates hold for both the continuous-time flow and its discrete-time discretization under the stated extra assumptions.
The mean-field limit equation is well-posed.
The interacting particle system satisfies propagation of chaos.
The formulation extends to Hilbert-valued feature maps, giving a blockwise Muon flow for mixture-of-experts transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Hamiltonian structure may permit symplectic or structure-preserving discretizations that better preserve the dissipation identity than standard momentum methods.
The probability-measure view suggests direct application to ensemble or mean-field training of neural networks without passing through single-particle limits first.
Testing the dissipation identity on small-scale transformer layers could reveal whether the monotonicity survives the practical approximations used in Muon implementations.

Load-bearing premise

Exponential convergence of the objective gap requires extra gradient-dominance, bounded-momentum, and curvature/alignment conditions on the objective and the flow.

What would settle it

Numerical integration of the inertial Muon dynamics on a simple quadratic objective that checks whether the Hamiltonian energy decreases monotonically at every step.

Figures

Figures reproduced from arXiv: 2605.23871 by Aratrika Mustafi, Bharath K. Sriperumbudur, Soumya Mukherjee.

**Figure 2.** Figure 2: Experiment 1: matrix mean matching. Top: [PITH_FULL_IMAGE:figures/full_fig_p061_2.png] view at source ↗

**Figure 3.** Figure 3: Experiment 2: product-space teacher-student particles with [PITH_FULL_IMAGE:figures/full_fig_p062_3.png] view at source ↗

read the original abstract

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(\rho)=R\left(\int F d \rho\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lifts regularized Muon to a mean-field probability setting and derives its inertial limit as damped Hamiltonian dynamics with an exact dissipation identity.

read the letter

The core new piece is the lift of regularized Muon to finite-particle objectives of the form J(ρ) = R(∫ F dρ), followed by the inertial continuous-time limit and the phase-space mean-field equation on parameter-momentum pairs. The key step is recognizing the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm, which turns the update into a mirror step with momentum as dual coordinate. From there they obtain a damped Hamiltonian probability flow and prove the Hamiltonian energy decreases monotonically by direct calculation from the structure. They also establish well-posedness of the mean-field equation and propagation of chaos for the interacting particles, plus a blockwise extension to Hilbert-valued features for mixture-of-experts models. These elements are not in the prior Muon literature cited in the abstract, so the Hamiltonian framing is genuinely new. The dissipation identity itself does not rely on the extra assumptions listed for convergence rates. The paper is explicit that gradient-dominance, bounded-momentum, and curvature/alignment conditions are added only for the exponential rates on the objective gap; the identity holds more generally. That separation is clean. The main limitation is that the rates themselves are conditional on those assumptions, which narrows their immediate scope, but the authors flag this clearly. The formal work on the particle system and the mean-field limit looks reproducible in principle once the proofs are checked. This is useful for people who already work on continuous-time or mean-field analyses of matrix optimizers in large models. A reader focused on optimizer theory or Wasserstein gradient flows would find the perspective worth engaging. It is worth sending to peer review because the construction is coherent and the dissipation result is a concrete addition even if the convergence claims stay conditional.

Referee Report

2 major / 0 minor

Summary. The paper develops a gradient flow on probability measures over matrix-valued parameters induced by a regularized version of the Muon optimizer. It identifies the regularized orthogonalization map as the gradient of a smooth Fenchel-dual smoothing of the nuclear norm, lifts the dynamics to finite-particle probability objectives of the form J(ρ)=R(∫F dρ), derives the inertial continuous-time limit, and passes to a phase-space mean-field equation. The resulting flow is shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential, with an exact Hamiltonian dissipation identity proving monotonic decrease of the Hamiltonian energy. Under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, exponential convergence rates are obtained for both continuous and discrete time; well-posedness of the mean-field equation and propagation of chaos are established, with an extension to Hilbert-valued feature maps for blockwise Muon flows in transformer MoE models.

Significance. If the derivations hold, the exact dissipation identity provides a structurally clean Hamiltonian perspective on inertial Muon dynamics that is stronger than generic energy dissipation in gradient flows. The mean-field lifting and propagation-of-chaos results are technically substantive for analyzing large-scale neural network training, and the blockwise extension directly targets practical MoE architectures. The work supplies a concrete bridge between mirror-descent interpretations of orthogonalization and probability-gradient-flow theory.

major comments (2)

[Abstract] Abstract (key observation paragraph): The identification of the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm is load-bearing for the entire Hamiltonian structure. It is presented as an observation; the manuscript must supply the explicit smoothing functional, the value of the regularization parameter, and the direct verification that its gradient recovers the Muon update, to rule out the possibility that the parameter is chosen post-hoc to enforce the desired mirror-map property.
[Abstract] Abstract (convergence paragraph): The exponential convergence rates are stated to hold only under gradient-dominance, bounded-momentum, and curvature/alignment assumptions. These assumptions are not shown to be satisfied by standard neural-network losses or by the Muon dynamics itself; the manuscript should either derive verifiable conditions under which they hold or provide a counter-example illustrating when the rates fail.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment below and will incorporate revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (key observation paragraph): The identification of the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm is load-bearing for the entire Hamiltonian structure. It is presented as an observation; the manuscript must supply the explicit smoothing functional, the value of the regularization parameter, and the direct verification that its gradient recovers the Muon update, to rule out the possibility that the parameter is chosen post-hoc to enforce the desired mirror-map property.

Authors: We agree that the explicit construction is essential for rigor. In the revised version we will state the smoothing functional explicitly as the Fenchel dual of the nuclear norm plus an ε-quadratic regularizer (with ε the regularization parameter), and include the direct gradient computation verifying that it recovers the regularized orthogonalization map used by Muon. This material will be added to the abstract and expanded in Section 2. revision: yes
Referee: [Abstract] Abstract (convergence paragraph): The exponential convergence rates are stated to hold only under gradient-dominance, bounded-momentum, and curvature/alignment assumptions. These assumptions are not shown to be satisfied by standard neural-network losses or by the Muon dynamics itself; the manuscript should either derive verifiable conditions under which they hold or provide a counter-example illustrating when the rates fail.

Authors: We accept that the assumptions require further discussion. The revision will add a remark deriving verifiable conditions (e.g., Polyak-Łojasiewicz inequality on the loss, which holds in certain overparameterized regimes) under which the rates apply, together with a simple counter-example where gradient dominance fails and only sublinear convergence is recovered. These additions will appear in the convergence analysis section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via independent mathematical identification

full rationale

The paper's core chain begins with the stated key observation that the regularized orthogonalization map equals the gradient of a Fenchel-dual smoothing of the nuclear norm. This is used to interpret the update as a mirror step and lift to the probability-measure objective J(ρ), yielding the inertial continuous-time limit and phase-space mean-field equation. The damped Hamiltonian structure and exact dissipation identity are then derived directly from this mirror potential. No equation is shown to equal its input by construction, no parameter is fitted then renamed as prediction, and no load-bearing premise rests on a self-citation chain. The additional convergence assumptions are isolated and do not affect the dissipation identity. The derivation therefore remains independent of its target outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the identification of the regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm (a domain assumption) and on the additional gradient-dominance/bounded-momentum/curvature assumptions required for convergence rates. The smoothing/regularization parameter functions as a free parameter whose value is not derived from first principles.

free parameters (1)

smoothing/regularization parameter
Controls the analytic smoothing of the idealized Muon orthogonalization map; its value is chosen to make the Fenchel-dual construction work.

axioms (1)

domain assumption The regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm.
Stated as the key observation that identifies the Muon update as a mirror/prox step.

invented entities (1)

damped Hamiltonian probability dynamics on parameter-momentum pairs no independent evidence
purpose: Describes the mean-field limit of the inertial Muon particle system
Introduced via the lifting to probability objectives; no independent falsifiable prediction outside the derivation is given.

pith-pipeline@v0.9.0 · 5859 in / 1337 out tokens · 29414 ms · 2026-05-25T02:54:46.508515+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Journal of Scientific Computing , volume=

Accelerated information gradient flow , author=. Journal of Scientific Computing , volume=. 2022 , publisher=

work page 2022
[2]

Advances in neural information processing systems , volume=

On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Efficient constrained sampling via the mirror-Langevin algorithm , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

Advances in Neural Information Processing Systems , volume=

Learning rate free sampling in constrained domains , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

arXiv preprint arXiv:2503.12645 , year=

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization , author=. arXiv preprint arXiv:2503.12645 , year=

work page arXiv
[6]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[7]

2025 , eprint=

Muon is Scalable for LLM Training , author=. 2025 , eprint=

work page 2025
[8]

Journal of machine learning research , volume=

Accelerating optimization over the space of probability measures , author=. Journal of machine learning research , volume=

work page
[9]

2026 , eprint=

Muon Dynamics as a Spectral Wasserstein Flow , author=. 2026 , eprint=

work page 2026
[10]

Advances in neural information processing systems , volume=

Towards understanding the mixture-of-experts layer in deep learning , author=. Advances in neural information processing systems , volume=

work page
[11]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[12]

2017 ,URL =

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ,author =. 2017 ,URL =

work page 2017
[13]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[14]

2026 , eprint=

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author=. 2026 , eprint=

work page 2026
[15]

arXiv preprint arXiv:2602.08232 , year=

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization , author=. arXiv preprint arXiv:2602.08232 , year=

work page arXiv
[16]

arXiv preprint arXiv:2202.07052 , year=

Orthogonalising gradients to speed up neural network optimisation , author=. arXiv preprint arXiv:2202.07052 , year=

work page arXiv
[17]

Training Deep Learning Models with Norm-Constrained LMOs

Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Muon optimizes under spectral norm constraints , author=. arXiv preprint arXiv:2506.15054 , year=

work page arXiv

[1] [1]

Journal of Scientific Computing , volume=

Accelerated information gradient flow , author=. Journal of Scientific Computing , volume=. 2022 , publisher=

work page 2022

[2] [2]

Advances in neural information processing systems , volume=

On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

work page

[3] [3]

Advances in Neural Information Processing Systems , volume=

Efficient constrained sampling via the mirror-Langevin algorithm , author=. Advances in Neural Information Processing Systems , volume=

work page

[4] [4]

Advances in Neural Information Processing Systems , volume=

Learning rate free sampling in constrained domains , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

arXiv preprint arXiv:2503.12645 , year=

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization , author=. arXiv preprint arXiv:2503.12645 , year=

work page arXiv

[6] [6]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024

[7] [7]

2025 , eprint=

Muon is Scalable for LLM Training , author=. 2025 , eprint=

work page 2025

[8] [8]

Journal of machine learning research , volume=

Accelerating optimization over the space of probability measures , author=. Journal of machine learning research , volume=

work page

[9] [9]

2026 , eprint=

Muon Dynamics as a Spectral Wasserstein Flow , author=. 2026 , eprint=

work page 2026

[10] [10]

Advances in neural information processing systems , volume=

Towards understanding the mixture-of-experts layer in deep learning , author=. Advances in neural information processing systems , volume=

work page

[11] [11]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page

[12] [12]

2017 ,URL =

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ,author =. 2017 ,URL =

work page 2017

[13] [13]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[14] [14]

2026 , eprint=

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author=. 2026 , eprint=

work page 2026

[15] [15]

arXiv preprint arXiv:2602.08232 , year=

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization , author=. arXiv preprint arXiv:2602.08232 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2202.07052 , year=

Orthogonalising gradients to speed up neural network optimisation , author=. arXiv preprint arXiv:2202.07052 , year=

work page arXiv

[17] [17]

Training Deep Learning Models with Norm-Constrained LMOs

Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Muon optimizes under spectral norm constraints , author=. arXiv preprint arXiv:2506.15054 , year=

work page arXiv