MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang · 2026 · cs.LG · arXiv 2602.21545

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18

citing papers explorer

Showing 2 of 2 citing papers.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 16 · internal anchor
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unreviewed · ref 170 · internal anchor

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer