Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Binghui Li; Han Zhong; Kaifei Wang; Liwei Wang; Pinyan Lu

arxiv: 2602.05725 · v2 · pith:Y7FO5DPZnew · submitted 2026-02-05 · 💻 cs.LG · math.OC· stat.ML

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Binghui Li , Kaifei Wang , Han Zhong , Pinyan Lu , Liwei Wang This is my paper

classification 💻 cs.LG math.OCstat.ML

keywords muonscalingfrequencygradientmatrixassociativecasecomponents

0 comments

read the original abstract

Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-law frequency spectrum, we derive Muon's scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Phases of Muon: When Muon Eclipses SignSGD
math.OC 2026-05 unverdicted novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
cs.LG 2026-03 unverdicted novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.