pith. sign in

arxiv: 2603.10067 · v2 · pith:TKS7HMXInew · submitted 2026-03-10 · 💻 cs.LG · cs.AI

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Pith reviewed 2026-05-25 06:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Muon optimizerheavy-tailed spectraspectral correctionLLM pretrainingSchatten normoptimizer improvementconvergence analysis
0
0 comments X

The pith

HTMuon corrects Muon's update rule to allow heavier-tailed weight spectra while retaining interdependency capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the orthogonalized updates in Muon suppress the emergence of heavy-tailed weight spectra, which limits performance according to heavy-tailed self-regularization ideas. HTMuon adds a spectral correction step that produces heavier-tailed updates and weight spectra without losing the ability to model parameter interdependencies. Experiments on language model pretraining and image classification demonstrate consistent gains, including lower perplexity on LLaMA models trained on C4. A reader would care because the change is presented as a simple, compatible modification that could make large-scale training more effective.

Core claim

HTMuon modifies the Muon optimizer through a heavy-tailed spectral correction applied to its updates. This preserves Muon's handling of parameter interdependencies but generates heavier-tailed updates that induce heavier-tailed weight spectra. The method is shown to correspond to steepest descent under a Schatten-q norm constraint, with accompanying convergence analysis in smooth non-convex settings, and yields empirical improvements such as up to 0.98 lower perplexity versus Muon on LLaMA pretraining.

What carries the argument

Heavy-tailed spectral correction, which adjusts the singular values of the orthogonalized Muon update to produce a heavier-tailed spectrum.

If this is right

  • HTMuon serves as a plug-in improvement that can be combined with existing Muon variants.
  • The method delivers consistent gains on LLM pretraining tasks such as LLaMA on C4 and on image classification.
  • It admits an interpretation as steepest descent under the Schatten-q norm constraint.
  • Convergence holds in smooth non-convex optimization settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral adjustment principle might be tested on other orthogonalized or momentum-based optimizers to check for similar gains.
  • Explicit control of weight spectrum tail weight could become a design lever when scaling optimizers to larger models.
  • Varying the q parameter in the Schatten norm offers a direct knob for tuning the degree of heavy-tailedness induced by the correction.

Load-bearing premise

That suppressing heavy-tailed weight spectra through Muon's orthogonalized rule harms performance and that the proposed correction will improve results without instability or other drawbacks.

What would settle it

An experiment on the reported LLaMA pretraining setup in which HTMuon produces no measurable increase in tail heaviness of the weight spectra or yields no perplexity reduction relative to Muon would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2603.10067 by Lei Hsiung, Shenyang Deng, Shuhua Yu, Tianyu Pang, Yaoqing Yang, Yujie Fang, Zihang Liu.

Figure 1
Figure 1. Figure 1: Muon NS vs. Muon SVD on C4 dataset. (a) Validation perplexity for LLaMA-60M/135M trained: Muon NS consistently achieves lower perplexity than Muon SVD. Both Learning rates for 60M is 0.03 and for 135M is 0.02. (b)(c) Spectra of update matrices at steps 1/9000/19000 shown in different colors: Muon SVD enforces an exactly ”all-ones” spectrum in the update matrices; Muon NS stays close to one but retains noti… view at source ↗
Figure 2
Figure 2. Figure 2: (a)Average PL α¯ of weight ESDs for LLaMA-60M and LLaMA-135M trained on C4 with Muon and COSMOS. Muon yields a higher mean α¯, indicating less heavy-tailed spectra than COSMOS; (b) COSMOS outperforms Muon for LLaMA-60M and LLaMA-135M models on C4 datatset. 4 Methodology In this section, we introduce our method HTMuon. Our design goal is to preserve Muon’s ability to capture parameter interdependencies whil… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with Muon variant optimizers on LLaMA-60M and 135M on C4 dataset. All optimizers are carefully tuned via grid search; detailed results and hyperparameter settings are provided in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison with state-of-the-art pretraining optimizers on LLaMA-60M and 135M on C4 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: we evaluate applying HTMuon and HTMuon NS on LLaMA-60M and LLaMA-135M every 1, 5, 10, and 25 steps. We report the average per-step runtime overhead for all methods. Detailed results and hyperparameter settings are provided in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise PL α for LLaMA and ResNet model weights trained with Muon and HTMuon. All models used for visualization are trained using each optimizer’s best-performing hyperparameter configuration. For hyperparameter configurations, please refer to Appendix C. Additional generalization metrics. In [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise spectral norm and frobenius norm for LLaMA model weights trained with [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: We conduct grid search on p. We find p = 0.125 is a strong choice. Note that p = 0 reduces to Muon. Detailed results provided in [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training loss curves for LLaMA-60M and LLaMA-135M. Learning rate for both models and [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: we evaluate applying HTMuon+ NorMuon on LLaMA-60M and LLaMA-135M every 1, 5, 10, and 25 steps (while other steps applying NorMuon) . We report the average per-step runtime overhead for all methods. Detailed results and hyperparameter settings are provided in [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a)(b): Layer-wise PL α for ResNet-18 model weight ESDs on CIFAR-100 and CIFAR-10 trained with Muon and HTMuon. (c)(d): Layer-wise PL α for LLaMA model weight ESDs trained with Muon and HTMuon HT. All models used for visualization are trained using each optimizer’s best-performing hyperparameter configuration. For hyperparameter configurations, please refer to Appendix C. B.4.2 Varying different p In tabl… view at source ↗
Figure 12
Figure 12. Figure 12: We conduct learning-rate grid searches for [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
read the original abstract

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HTMuon as an enhancement to the Muon optimizer for training large language models and other neural networks. It argues that Muon's orthogonalized updates suppress heavy-tailed weight spectra, which is detrimental according to Heavy-Tailed Self-Regularization (HT-SR) theory. HTMuon applies a spectral correction to produce heavier-tailed updates and spectra while maintaining the ability to capture parameter interdependencies. The authors demonstrate empirical improvements on LLM pretraining (e.g., up to 0.98 lower perplexity on LLaMA with C4 dataset) and image classification tasks, show plug-in compatibility with other Muon variants, and provide theoretical results linking HTMuon to steepest descent under the Schatten-q norm along with convergence guarantees for smooth non-convex optimization.

Significance. If the performance improvements are robust and the theoretical equivalence holds rigorously, this work would offer a meaningful contribution to the field of optimization for deep learning by integrating insights from heavy-tailed spectral analysis into practical optimizers. It could lead to better training dynamics for large models and encourage further exploration of norm-based updates in non-convex settings. The plug-in nature and open-source implementation add to its potential impact.

major comments (2)
  1. [Experimental Results] Experimental Results section: The performance claims, including the reduction in perplexity by up to 0.98 compared to Muon on LLaMA/C4, are presented without standard deviations, error bars, or results across multiple random seeds. This makes it impossible to assess whether the gains are statistically reliable or reproducible, which is load-bearing for the central empirical claim of consistent improvement.
  2. [Theoretical Analysis] Theoretical Analysis section: The claim that HTMuon corresponds to steepest descent under the Schatten-q norm (and the associated convergence analysis) is presented as new analysis, but the derivation steps are not provided in sufficient detail to verify independence from the update rule or to rule out circularity. This equivalence is central to the paper's theoretical contribution and requires explicit expansion.
minor comments (2)
  1. [Abstract] The abstract refers to improvements over 'state-of-the-art baselines' without naming them explicitly beyond Muon; adding this detail would strengthen context for the reported gains.
  2. Notation for the spectral correction and Schatten-q norm could be introduced more clearly with a dedicated preliminary section to aid readers unfamiliar with HT-SR theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen both the empirical and theoretical sections.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: The performance claims, including the reduction in perplexity by up to 0.98 compared to Muon on LLaMA/C4, are presented without standard deviations, error bars, or results across multiple random seeds. This makes it impossible to assess whether the gains are statistically reliable or reproducible, which is load-bearing for the central empirical claim of consistent improvement.

    Authors: We agree that reporting variability is essential for evaluating the reliability of the reported gains. In the revised manuscript we will include results averaged over multiple random seeds together with standard deviations for the primary experiments, including the LLaMA pretraining on C4. Additional runs have been performed to obtain these statistics. revision: yes

  2. Referee: [Theoretical Analysis] Theoretical Analysis section: The claim that HTMuon corresponds to steepest descent under the Schatten-q norm (and the associated convergence analysis) is presented as new analysis, but the derivation steps are not provided in sufficient detail to verify independence from the update rule or to rule out circularity. This equivalence is central to the paper's theoretical contribution and requires explicit expansion.

    Authors: We acknowledge that the current presentation of the derivations is too concise. In the revision we will expand the Theoretical Analysis section with complete, step-by-step derivations of the equivalence between HTMuon and steepest descent under the Schatten-q norm, explicitly showing independence from the particular update rule and clarifying the convergence analysis to address any concerns about circular reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper motivates HTMuon from the external HT-SR theory to counteract Muon's orthogonal update suppressing heavy-tailed spectra, then states that the resulting rule corresponds to steepest descent under the Schatten-q norm with a separate non-convex convergence argument. No equation or self-citation in the provided text reduces this correspondence to a redefinition of the input update rule, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. Experiments comparing against Muon and other baselines on C4 pretraining supply independent falsifiable evidence. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on the prior HT-SR theory and the existing Muon formulation.

pith-pipeline@v0.9.0 · 5744 in / 1175 out tokens · 36749 ms · 2026-05-25T06:48:15.249600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LionMuon: Alternating Spectral and Sign Descent for Efficient Training

    cs.LG 2026-05 unverdicted novelty 6.0

    LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...

  2. RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

    cs.LG 2026-03 conditional novelty 5.0

    RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

  2. [2]

    On the Convergence of Muon and Beyond

    Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816,

  3. [3]

    Distributional associations vs in-context reasoning: A study of feed-forward and attention layers.arXiv preprint arXiv:2406.03068,

    Lei Chen, Joan Bruna, and Alberto Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers.arXiv preprint arXiv:2406.03068,

  4. [4]

    Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

  5. [5]

    net/forum?id=ne6zeqLFCZ

    URL https://openreview. net/forum?id=ne6zeqLFCZ. Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue M Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities.arXiv preprint arXiv:2410.18938,

  6. [6]

    When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

    Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

  7. [7]

    Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882,

    15 Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov´ a, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882,

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  9. [9]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

  10. [10]

    Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

  11. [11]

    Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562,

    Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562,

  12. [12]

    Models of heavy-tailed mechanistic universality

    Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470,

  13. [13]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  14. [14]

    Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

  15. [15]

    Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

    Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

  16. [16]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

    Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a. URLhttps://arxiv.org/abs/2305.14342. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei ...

  17. [17]

    Charles H Martin and Christopher Hinrichs

    URLhttps://arxiv.org/abs/2601.13474. Charles H Martin and Christopher Hinrichs. Setol: A semi-empirical theory of (deep) learning.arXiv preprint arXiv:2507.17912,

  18. [18]

    Spectral Normalization for Generative Adversarial Networks

    17 Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks.arXiv preprint arXiv:1802.05957,

  19. [19]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

  20. [20]

    Training Deep Learning Models with Norm-Constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

  21. [21]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

  22. [22]

    Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashish Vaswani. Prac...

  23. [23]

    The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

    Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

  24. [24]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737,

  25. [25]

    Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

    Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

  26. [26]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  27. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  28. [28]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  29. [29]

    The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a

    18 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, and Lei Wu. The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a. Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-en...

  30. [30]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

  31. [31]

    Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

  32. [32]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

  33. [33]

    Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

  34. [34]

    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,

  35. [35]

    Consider that M0 = G0,M t = βMt−1 + (1 −β )Gt, under Assumptions 6.2 and 6.3, if we assume η = max{ηt}T t=1, under Assumption 6.4, we have E∥∇f(W t)−Mt∥F ≤ s 1−β 1 +β σ√ B + βtσ√ B + √rlpβLη 1−β , whereG t = 1 B PB i=1 ∇f(W t;ξ t,i)and B is batch size. 21 Proof.We define C0 = ∇f(W0),C t = βCt−1 + (1 −β )∇f(Wt) = (1 −β )Pt i=1 βt−i∇f(Wi) + βt∇f(W0), under ...

  36. [36]

    (2023); Garrigos and Gower (2023)

    and SGDM’s sample complexity in Arjevani et al. (2023); Garrigos and Gower (2023). .□ A.3 Lemma for PL exponentα Lemma A.6.Suppose the singular values of matrixW∈R n×m follows k =s 1k−s,1≤k≤n, we have PL exponentαofWsatisfiesα= 1 + 1 2s . Proof.Since we have sk = s1k−s, 1 ≤k≤n , which means eigenvalues λk = λ1k−2s, 1 ≤k≤n . Here we suppose Λ is a random v...

  37. [37]

    Learning rate for both models and optimizers is 0.03

    25 4000 6000 8000 10000 Steps 3.25 3.50 3.75 4.00 4.25Training Loss Muon HTMuon (a) LLaMa-60M on C4 5000 10000 15000 20000 Steps 3.00 3.25 3.50 3.75 4.00 4.25Training Loss Muon HTMuon (b) LLaMA-135M on C4 Figure 9: Training loss curves for LLaMA-60M and LLaMA-135M. Learning rate for both models and optimizers is 0.03. Both curves are smoothed via a simple...

  38. [38]

    For example, COSMOS outperforms Muon by 0.15 PPL for LLaMA-135M in (Liu et al., 2025b) and AlphaDecay outperforms Adam by 0.11 PPL for LLaMA-1B in (He et al., 2025)

    and Figure 2 in (Wen et al., 2025), after carefully tune the hyperparameters of the baselines on LLaMA/C4, an improvement of ≥ 0.2 PPL over Muon is generally regarded as non-negligible. For example, COSMOS outperforms Muon by 0.15 PPL for LLaMA-135M in (Liu et al., 2025b) and AlphaDecay outperforms Adam by 0.11 PPL for LLaMA-1B in (He et al., 2025). There...

  39. [39]

    We run all the experiments on one NVIDIA RTX PRO 6000 Blackwell

    For training on ImageNet-1K datasets, we set batch size = 1024 and we set p = 0.03125. We run all the experiments on one NVIDIA RTX PRO 6000 Blackwell. We set learning rate for {0.003, 0.004, 0.005 } for Adam,Muon,HTMuon. D Baseline Optimizers In this section, we provide the algorithms for all optimizers evaluated in our study. We adopt the following nota...

  40. [40]

    We put the implementations in Algorithm

    AdaMuon (Si et al., 2025): An adaptive variant of Muon that incorporates second-moment information into orthogonalized updates. We put the implementations in Algorithm

  41. [41]

    We put the implementations in Algorithm

    MARS (Yuan et al., 2024): A momentum-based adaptive optimizer included as a representative adaptive baseline. We put the implementations in Algorithm

  42. [42]

    We put the implementations in Algorithm

    SOAP (Vyas et al., 2024): An optimizer that applies stochastic orthogonalization or projection to gradient updates. We put the implementations in Algorithm

  43. [43]

    We put the implementations in Algorithm

    Cautious (Liang et al., 2024): An optimizer that modifies update application in a conservative manner based on gradient information. We put the implementations in Algorithm

  44. [44]

    We put the implementations in Algorithm

    GaLore (Zhao et al., 2024): A memory-efficient optimizer that performs low-rank gradient projection to reduce optimizer-state and update costs, enabling large-model training under limited GPU memory. We put the implementations in Algorithm