HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
Pith reviewed 2026-05-25 06:48 UTC · model grok-4.3
The pith
HTMuon corrects Muon's update rule to allow heavier-tailed weight spectra while retaining interdependency capture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HTMuon modifies the Muon optimizer through a heavy-tailed spectral correction applied to its updates. This preserves Muon's handling of parameter interdependencies but generates heavier-tailed updates that induce heavier-tailed weight spectra. The method is shown to correspond to steepest descent under a Schatten-q norm constraint, with accompanying convergence analysis in smooth non-convex settings, and yields empirical improvements such as up to 0.98 lower perplexity versus Muon on LLaMA pretraining.
What carries the argument
Heavy-tailed spectral correction, which adjusts the singular values of the orthogonalized Muon update to produce a heavier-tailed spectrum.
If this is right
- HTMuon serves as a plug-in improvement that can be combined with existing Muon variants.
- The method delivers consistent gains on LLM pretraining tasks such as LLaMA on C4 and on image classification.
- It admits an interpretation as steepest descent under the Schatten-q norm constraint.
- Convergence holds in smooth non-convex optimization settings.
Where Pith is reading between the lines
- The same spectral adjustment principle might be tested on other orthogonalized or momentum-based optimizers to check for similar gains.
- Explicit control of weight spectrum tail weight could become a design lever when scaling optimizers to larger models.
- Varying the q parameter in the Schatten norm offers a direct knob for tuning the degree of heavy-tailedness induced by the correction.
Load-bearing premise
That suppressing heavy-tailed weight spectra through Muon's orthogonalized rule harms performance and that the proposed correction will improve results without instability or other drawbacks.
What would settle it
An experiment on the reported LLaMA pretraining setup in which HTMuon produces no measurable increase in tail heaviness of the weight spectra or yields no perplexity reduction relative to Muon would falsify the central performance claim.
Figures
read the original abstract
Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HTMuon as an enhancement to the Muon optimizer for training large language models and other neural networks. It argues that Muon's orthogonalized updates suppress heavy-tailed weight spectra, which is detrimental according to Heavy-Tailed Self-Regularization (HT-SR) theory. HTMuon applies a spectral correction to produce heavier-tailed updates and spectra while maintaining the ability to capture parameter interdependencies. The authors demonstrate empirical improvements on LLM pretraining (e.g., up to 0.98 lower perplexity on LLaMA with C4 dataset) and image classification tasks, show plug-in compatibility with other Muon variants, and provide theoretical results linking HTMuon to steepest descent under the Schatten-q norm along with convergence guarantees for smooth non-convex optimization.
Significance. If the performance improvements are robust and the theoretical equivalence holds rigorously, this work would offer a meaningful contribution to the field of optimization for deep learning by integrating insights from heavy-tailed spectral analysis into practical optimizers. It could lead to better training dynamics for large models and encourage further exploration of norm-based updates in non-convex settings. The plug-in nature and open-source implementation add to its potential impact.
major comments (2)
- [Experimental Results] Experimental Results section: The performance claims, including the reduction in perplexity by up to 0.98 compared to Muon on LLaMA/C4, are presented without standard deviations, error bars, or results across multiple random seeds. This makes it impossible to assess whether the gains are statistically reliable or reproducible, which is load-bearing for the central empirical claim of consistent improvement.
- [Theoretical Analysis] Theoretical Analysis section: The claim that HTMuon corresponds to steepest descent under the Schatten-q norm (and the associated convergence analysis) is presented as new analysis, but the derivation steps are not provided in sufficient detail to verify independence from the update rule or to rule out circularity. This equivalence is central to the paper's theoretical contribution and requires explicit expansion.
minor comments (2)
- [Abstract] The abstract refers to improvements over 'state-of-the-art baselines' without naming them explicitly beyond Muon; adding this detail would strengthen context for the reported gains.
- Notation for the spectral correction and Schatten-q norm could be introduced more clearly with a dedicated preliminary section to aid readers unfamiliar with HT-SR theory.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen both the empirical and theoretical sections.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: The performance claims, including the reduction in perplexity by up to 0.98 compared to Muon on LLaMA/C4, are presented without standard deviations, error bars, or results across multiple random seeds. This makes it impossible to assess whether the gains are statistically reliable or reproducible, which is load-bearing for the central empirical claim of consistent improvement.
Authors: We agree that reporting variability is essential for evaluating the reliability of the reported gains. In the revised manuscript we will include results averaged over multiple random seeds together with standard deviations for the primary experiments, including the LLaMA pretraining on C4. Additional runs have been performed to obtain these statistics. revision: yes
-
Referee: [Theoretical Analysis] Theoretical Analysis section: The claim that HTMuon corresponds to steepest descent under the Schatten-q norm (and the associated convergence analysis) is presented as new analysis, but the derivation steps are not provided in sufficient detail to verify independence from the update rule or to rule out circularity. This equivalence is central to the paper's theoretical contribution and requires explicit expansion.
Authors: We acknowledge that the current presentation of the derivations is too concise. In the revision we will expand the Theoretical Analysis section with complete, step-by-step derivations of the equivalence between HTMuon and steepest descent under the Schatten-q norm, explicitly showing independence from the particular update rule and clarifying the convergence analysis to address any concerns about circular reasoning. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper motivates HTMuon from the external HT-SR theory to counteract Muon's orthogonal update suppressing heavy-tailed spectra, then states that the resulting rule corresponds to steepest descent under the Schatten-q norm with a separate non-convex convergence argument. No equation or self-citation in the provided text reduces this correspondence to a redefinition of the input update rule, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. Experiments comparing against Muon and other baselines on C4 pretraining supply independent falsifiable evidence. The chain is therefore self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
-
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
Reference graph
Works this paper leans on
-
[1]
Modular duality in deep learning.arXiv preprint arXiv:2410.21265,
Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265,
-
[2]
On the Convergence of Muon and Beyond
Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Lei Chen, Joan Bruna, and Alberto Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers.arXiv preprint arXiv:2406.03068,
-
[4]
Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,
Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,
-
[5]
URL https://openreview. net/forum?id=ne6zeqLFCZ. Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue M Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities.arXiv preprint arXiv:2410.18938,
-
[6]
When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,
Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,
-
[7]
15 Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov´ a, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882,
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[9]
Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
-
[11]
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562,
-
[12]
Models of heavy-tailed mechanistic universality
Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470,
-
[13]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,
-
[15]
Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,
Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,
-
[16]
Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a. URLhttps://arxiv.org/abs/2305.14342. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei ...
-
[17]
Charles H Martin and Christopher Hinrichs
URLhttps://arxiv.org/abs/2601.13474. Charles H Martin and Christopher Hinrichs. Setol: A semi-empirical theory of (deep) learning.arXiv preprint arXiv:2507.17912,
-
[18]
Spectral Normalization for Generative Adversarial Networks
17 Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks.arXiv preprint arXiv:1802.05957,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Training Deep Learning Models with Norm-Constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,
-
[22]
Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashish Vaswani. Prac...
-
[23]
Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,
-
[24]
On the Convergence Analysis of Muon
Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,
Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,
-
[26]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
18 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, and Lei Wu. The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a. Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-en...
-
[30]
Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,
Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,
-
[31]
Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,
-
[32]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,
-
[34]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Consider that M0 = G0,M t = βMt−1 + (1 −β )Gt, under Assumptions 6.2 and 6.3, if we assume η = max{ηt}T t=1, under Assumption 6.4, we have E∥∇f(W t)−Mt∥F ≤ s 1−β 1 +β σ√ B + βtσ√ B + √rlpβLη 1−β , whereG t = 1 B PB i=1 ∇f(W t;ξ t,i)and B is batch size. 21 Proof.We define C0 = ∇f(W0),C t = βCt−1 + (1 −β )∇f(Wt) = (1 −β )Pt i=1 βt−i∇f(Wi) + βt∇f(W0), under ...
work page 2025
-
[36]
(2023); Garrigos and Gower (2023)
and SGDM’s sample complexity in Arjevani et al. (2023); Garrigos and Gower (2023). .□ A.3 Lemma for PL exponentα Lemma A.6.Suppose the singular values of matrixW∈R n×m follows k =s 1k−s,1≤k≤n, we have PL exponentαofWsatisfiesα= 1 + 1 2s . Proof.Since we have sk = s1k−s, 1 ≤k≤n , which means eigenvalues λk = λ1k−2s, 1 ≤k≤n . Here we suppose Λ is a random v...
work page 2023
-
[37]
Learning rate for both models and optimizers is 0.03
25 4000 6000 8000 10000 Steps 3.25 3.50 3.75 4.00 4.25Training Loss Muon HTMuon (a) LLaMa-60M on C4 5000 10000 15000 20000 Steps 3.00 3.25 3.50 3.75 4.00 4.25Training Loss Muon HTMuon (b) LLaMA-135M on C4 Figure 9: Training loss curves for LLaMA-60M and LLaMA-135M. Learning rate for both models and optimizers is 0.03. Both curves are smoothed via a simple...
-
[38]
and Figure 2 in (Wen et al., 2025), after carefully tune the hyperparameters of the baselines on LLaMA/C4, an improvement of ≥ 0.2 PPL over Muon is generally regarded as non-negligible. For example, COSMOS outperforms Muon by 0.15 PPL for LLaMA-135M in (Liu et al., 2025b) and AlphaDecay outperforms Adam by 0.11 PPL for LLaMA-1B in (He et al., 2025). There...
-
[39]
We run all the experiments on one NVIDIA RTX PRO 6000 Blackwell
For training on ImageNet-1K datasets, we set batch size = 1024 and we set p = 0.03125. We run all the experiments on one NVIDIA RTX PRO 6000 Blackwell. We set learning rate for {0.003, 0.004, 0.005 } for Adam,Muon,HTMuon. D Baseline Optimizers In this section, we provide the algorithms for all optimizers evaluated in our study. We adopt the following nota...
-
[40]
We put the implementations in Algorithm
AdaMuon (Si et al., 2025): An adaptive variant of Muon that incorporates second-moment information into orthogonalized updates. We put the implementations in Algorithm
work page 2025
-
[41]
We put the implementations in Algorithm
MARS (Yuan et al., 2024): A momentum-based adaptive optimizer included as a representative adaptive baseline. We put the implementations in Algorithm
work page 2024
-
[42]
We put the implementations in Algorithm
SOAP (Vyas et al., 2024): An optimizer that applies stochastic orthogonalization or projection to gradient updates. We put the implementations in Algorithm
work page 2024
-
[43]
We put the implementations in Algorithm
Cautious (Liang et al., 2024): An optimizer that modifies update application in a conservative manner based on gradient information. We put the implementations in Algorithm
work page 2024
-
[44]
We put the implementations in Algorithm
GaLore (Zhao et al., 2024): A memory-efficient optimizer that performs low-rank gradient projection to reduce optimizer-state and update costs, enabling large-model training under limited GPU memory. We put the implementations in Algorithm
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.