hub Canonical reference

arXiv preprint arXiv:2507.11005 , year=

Chongjie Si, Debing Zhang, Wei Shen · 2025 · arXiv 2507.11005

Canonical reference. 80% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18 · conditional · novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

cs.LG · 2026-03-20 · conditional · novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

cs.LG · 2026-03-10 · unverdicted · novelty 5.0

HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

cs.LG · 2025-09-15 · unverdicted · novelty 5.0

Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

citing papers explorer

Showing 17 of 17 citing papers.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 33
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 38
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 31
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · conditional · none · ref 140
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
Muon is Not That Special: Random or Inverted Spectra Work Just as Well cs.LG · 2026-05-11 · unverdicted · none · ref 13
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Accelerating LMO-Based Optimization via Implicit Gradient Transport cs.LG · 2026-05-07 · unverdicted · none · ref 13
LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo cs.LG · 2026-04-19 · unverdicted · none · ref 44
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
On the Convergence of Muon and Beyond cs.LG · 2025-09-19 · unverdicted · none · ref 44
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence cs.LG · 2026-05-13 · unverdicted · none · ref 13
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 15
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 48
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 73
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 18
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 37
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization cs.LG · 2026-03-20 · conditional · none · ref 14
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 25
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training cs.LG · 2025-09-15 · unverdicted · none · ref 49
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

arXiv preprint arXiv:2507.11005 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer