Muon is Not That Special: Random or Inverted Spectra Work Just as Well

· 2026 · cs.LG · arXiv 2605.11181

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

representative citing papers

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

math.OC · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.

Muon as a Residual Connection

cs.LG · 2026-07-01 · unverdicted · novelty 3.0

Muon is interpreted as an implicit residual connection that sacrifices local gradient fidelity to improve downstream layer usability in neural network training.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss cs.LG · 2026-06-04 · unverdicted · none · ref 175 · internal anchor
Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
Muon as a Residual Connection cs.LG · 2026-07-01 · unverdicted · none · ref 7 · internal anchor
Muon is interpreted as an implicit residual connection that sacrifices local gradient fidelity to improve downstream layer usability in neural network training.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

fields

years

verdicts

representative citing papers

citing papers explorer