Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He; Zhanwang Deng; Zhaosong Lu

arxiv: 2509.11983 · v2 · submitted 2025-09-15 · 💻 cs.LG · math.OC

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He , Zhanwang Deng , Zhaosong Lu This is my paper

Pith reviewed 2026-05-18 15:43 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords low-rank orthogonalizationMuon optimizerfoundation model trainingmatrix optimizationneural network trainingstochastic optimizationgradient descentiteration complexity

0 comments

The pith

Low-rank orthogonalization exploits the low-rank structure of gradients to create a Muon variant that outperforms the original on large foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes low-rank orthogonalization, which performs matrix orthogonalization by leveraging the low-rank nature of gradients during neural network training. This forms the basis for low-rank matrix-signed gradient descent and a low-rank variant of Muon. Experiments on GPT-2 and LLaMA pretraining demonstrate that low-rank Muon surpasses carefully tuned vanilla Muon, particularly on tasks with large model sizes. The work also derives iteration complexity bounds for low-rank MSGD to reach approximate stationary solutions and for low-rank Muon under heavy-tailed stochastic noise.

Core claim

By replacing full orthogonalization with a low-rank version that uses the low-rank structure of gradients, low-rank MSGD achieves provable iteration complexity for approximate stationary points while low-rank Muon does the same for approximate stochastic stationary points under heavy-tailed noise. Numerical results show this yields superior performance over vanilla Muon in GPT-2 and LLaMA pretraining at large scales.

What carries the argument

low-rank orthogonalization, the process of orthogonalizing gradient matrices after projecting them to a low-rank subspace to exploit their observed low-rank structure during training

If this is right

Low-rank Muon surpasses vanilla Muon on GPT-2 and LLaMA pretraining tasks with large model sizes.
Low-rank MSGD reaches an approximate stationary solution with established iteration complexity.
Low-rank Muon reaches an approximate stochastic stationary solution under heavy-tailed noise with established iteration complexity.
The low-rank approach applies directly to other large-scale matrix optimization problems arising in neural network training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the low-rank gradient property persists at even larger scales, the method could cut memory and compute costs for orthogonalization steps in models beyond current LLaMA sizes.
Similar low-rank projections might improve other matrix-aware optimizers that rely on orthogonalization or signed updates.
The assumption could be validated by tracking the effective rank of gradients across different model architectures and datasets to identify when the speedup is reliable.

Load-bearing premise

Gradients during neural network training have enough low-rank structure that replacing full orthogonalization with a low-rank version preserves convergence and model quality.

What would settle it

A controlled experiment on a large LLaMA pretraining run where low-rank Muon converges to a strictly worse loss or requires substantially more steps than vanilla Muon would falsify the practical claim.

Figures

Figures reproduced from arXiv: 2509.11983 by Chuan He, Zhanwang Deng, Zhaosong Lu.

**Figure 2.** Figure 2: Time distribution for low-rank orthogonalization with Gaussian sketching, including the QR [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of variance levels in matrix sign estimation across Newton-Schulz iterations (NS), [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Objective values (top row) and gradient Frobenius norms (bottom row) for all methods during [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Singular value distributions of the Q, K, and V matrices: across layers at iteration 300 (top [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of validation perplexity versus computational time for all competing methods in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Singular value distributions of the Q, K, and V matrices: across layers at iteration 300 (top [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of validation perplexity and computational time for all competing methods in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose \textit{low-rank orthogonalization}, which performs orthogonalization by leveraging the low-rank nature of gradients during NN training. Building on this, we introduce low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of Muon. Numerical experiments demonstrate the superior performance of low-rank orthogonalization, with low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the carefully tuned vanilla Muon on tasks with large model sizes. Theoretically, we establish the iteration complexity of low-rank MSGD for finding an approximate stationary solution, and the iteration complexity of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise. The code to reproduce our numerical experiments is available at https://github.com/dengzhanwang/Low-rank-Muon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Low-rank Muon is a direct extension of the existing Muon optimizer that reports gains on large models, but the low-rank assumption stays unverified without any gradient rank diagnostics.

read the letter

The main thing to know is that this paper replaces full orthogonalization in Muon with a low-rank version and reports that the resulting low-rank Muon beats a carefully tuned vanilla Muon on GPT-2 and LLaMA pretraining runs, especially at larger model sizes. They also introduce low-rank MSGD and give iteration complexity bounds for both the deterministic and stochastic cases under heavy-tailed noise. The code release is useful for anyone who wants to test it directly. What is actually new is the low-rank orthogonalization operator and its application to produce the low-rank Muon variant; this is not just a re-statement of the prior Muon work. The empirical results on real foundation model pretraining are the clearest positive here, since they show concrete performance differences on tasks that matter. The theory appears to rest on standard arguments rather than anything circular or self-referential. The soft spot is exactly what the stress-test note flags. The motivation and design both rely on gradients having exploitable low-rank structure during training, yet the paper does not report any direct measurements such as effective numerical rank, singular-value decay, or the fraction of Frobenius norm captured by the top-k components at training steps. Without those diagnostics, it is difficult to rule out that the observed wins come from incidental changes in effective step size or hyper-parameter effects instead of the low-rank mechanism. If those checks exist in the full manuscript they would strengthen the case; from the description they are absent. This work is for researchers who follow matrix-aware optimizers and large-scale training methods. A reader already interested in Muon or related approaches would get value from the new procedure and the reported scaling behavior. It deserves a serious referee because it contains a concrete algorithmic addition, public code, and empirical results on relevant models, even though the central assumption could use tighter verification.

Referee Report

2 major / 2 minor

Summary. The paper proposes low-rank orthogonalization to exploit the low-rank structure of gradients in neural network training. It introduces low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of the Muon optimizer, reports empirical results showing low-rank Muon outperforming carefully tuned vanilla Muon in GPT-2 and LLaMA pretraining on large models, and derives iteration complexity bounds for low-rank MSGD (approximate stationary points) and low-rank Muon (approximate stochastic stationary points under heavy-tailed noise). Code is provided for reproducibility.

Significance. If the low-rank gradient structure is verified and the performance gains hold under rigorous controls, the work could improve computational efficiency of orthogonalization steps in large-scale foundation model training without sacrificing convergence quality. The availability of reproduction code and the extension of Muon to low-rank settings are positive contributions to the matrix-optimization literature for deep learning.

major comments (2)

[Numerical Experiments] The central empirical claim (low-rank Muon surpassing tuned vanilla Muon on large-model tasks) and the theoretical iteration bounds both rest on the premise that gradients possess exploitable low-rank structure. However, the manuscript reports neither the effective numerical rank nor the fraction of Frobenius norm captured by the top-k singular vectors at any training step in the GPT-2 or LLaMA experiments. Without these diagnostics, observed gains could arise from incidental implementation differences or hyper-parameter effects rather than the low-rank mechanism.
[Theoretical Analysis] §4 (theoretical analysis): the iteration complexity statements for low-rank MSGD and low-rank Muon appear to follow from standard stochastic optimization arguments applied to a truncated orthogonalization operator. The derivation does not explicitly quantify how the rank truncation threshold affects the constants in the complexity bounds or the bias introduced relative to full orthogonalization.

minor comments (2)

[Introduction] The abstract and introduction use the phrase 'low-rank nature of gradients' without a precise definition or reference to prior work quantifying this property in transformer training.
[Numerical Experiments] Figure captions and experimental tables should explicitly state the rank truncation threshold used for each model size and whether it was tuned or fixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the empirical validation of the low-rank structure and the transparency of the theoretical analysis. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.

read point-by-point responses

Referee: [Numerical Experiments] The central empirical claim (low-rank Muon surpassing tuned vanilla Muon on large-model tasks) and the theoretical iteration bounds both rest on the premise that gradients possess exploitable low-rank structure. However, the manuscript reports neither the effective numerical rank nor the fraction of Frobenius norm captured by the top-k singular vectors at any training step in the GPT-2 or LLaMA experiments. Without these diagnostics, observed gains could arise from incidental implementation differences or hyper-parameter effects rather than the low-rank mechanism.

Authors: We agree that providing these diagnostics is important to substantiate the low-rank premise. In the revised version, we will add new figures and tables in the experimental section that report, for multiple training steps in both the GPT-2 and LLaMA runs, the effective numerical rank (defined via singular values exceeding a small threshold relative to the largest) and the cumulative fraction of Frobenius norm captured by the top-k singular vectors. These additions will be computed from the existing experimental code and will directly address whether the observed gains align with the low-rank gradient structure. revision: yes
Referee: [Theoretical Analysis] §4 (theoretical analysis): the iteration complexity statements for low-rank MSGD and low-rank Muon appear to follow from standard stochastic optimization arguments applied to a truncated orthogonalization operator. The derivation does not explicitly quantify how the rank truncation threshold affects the constants in the complexity bounds or the bias introduced relative to full orthogonalization.

Authors: We appreciate this point on the need for greater explicitness. Although the current proofs apply standard stochastic optimization techniques to the low-rank truncated operator and control the truncation error via the assumed low-rank gradient structure, we will revise §4 to include an additional lemma or remark that explicitly bounds the dependence of the complexity constants on the rank threshold k. This will also quantify the bias relative to full orthogonalization in terms of the tail singular values, under the paper's low-rank gradient assumption, thereby clarifying the relationship between the low-rank and full-rank cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained

full rationale

The paper motivates low-rank orthogonalization from the empirical observation that gradients in NN training exhibit low-rank structure, then defines low-rank MSGD and low-rank Muon accordingly. Iteration complexity results for approximate stationary points are obtained by applying standard stochastic optimization arguments to these modified updates under heavy-tailed noise, without reducing to fitted parameters, self-definitional loops, or load-bearing self-citations. The central empirical claims rest on numerical experiments with GPT-2 and LLaMA rather than tautological renaming or imported uniqueness theorems. No step equates a prediction to its input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on the empirical observation that gradients are low-rank during training and on standard assumptions from stochastic optimization theory; no new particles or forces are postulated.

free parameters (1)

rank truncation threshold
The specific rank used for the low-rank approximation is chosen based on observed gradient structure and is not derived from first principles.

axioms (2)

domain assumption Gradients encountered during neural network training admit a useful low-rank approximation.
This premise is invoked to motivate replacing full orthogonalization with the low-rank variant.
standard math Standard stochastic gradient assumptions hold under heavy-tailed noise.
Used to establish the iteration complexity of low-rank Muon.

invented entities (1)

low-rank orthogonalization operator no independent evidence
purpose: Efficiently approximate matrix orthogonalization by operating only on dominant singular directions.
New algorithmic primitive introduced in the paper.

pith-pipeline@v0.9.0 · 5745 in / 1454 out tokens · 35522 ms · 2026-05-18T15:43:52.179357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we establish the iteration complexity of low-rank MSGD … under heavy-tailed noise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
math.OC 2026-05 unverdicted novelty 6.0

Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
cs.LG 2026-05 unverdicted novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered
cs.LG 2026-05 unverdicted novelty 5.0

Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 5 Pith papers · 9 internal anchors

[2]

K. Ahn, B. Xu, N. Abreu, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025
[3]

E. AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, A. Tanwer, D. J. Shah, K. Nguyen, K. Smith, M. Callahan, M. Pust, M. Parmar, P. Rushton, P. Mazarakis, R. Kapila, S. Srivastava, S. Singla, T. Romanski, Y. Vanjani, and A. Vaswani. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025
[4]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, Mar. 2025. Preprint

work page arXiv 2025
[5]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002
[6]

Old Optimizer, New Norm: An Anthology

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Bian, J.-F

F. Bian, J.-F. Cai, and R. Zhang. A preconditioned Riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45(4):2075–2103, 2024

work page 2075
[8]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

L. Bottou. Large-scale machine learning with stochastic gradient descent. InInternational Conference on Computational Statistics, pages 177–186. Springer, 2010

work page 2010
[10]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. InInternational Conference on Artificial Intelligence and Statistics, pages 111–119, 2015

work page 2015
[11]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2015. 22

work page 2015
[12]

D. E. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in neural information processing systems, volume 28, 2015

work page 2015
[13]

Carmon, J

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1):71–120, 2020

work page 2020
[14]

F. L. Cesista. Muon and a selective survey on steepest descent in Riemannian and non-Riemannian manifolds, 2025

work page 2025
[15]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[16]

Drineas, R

P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approxi- mating matrix multiplication.SIAM Journal on Computing, 36(1):132–157, 2006

work page 2006
[17]

Drineas, R

P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix.SIAM Journal on Computing, 36(1):158–183, 2006

work page 2006
[18]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

work page 2011
[19]

S. S. S. Duvvuri, F. Devvrit, R. Anil, C. Hsieh, and I. S. Dhillon. Combining axes preconditioners through Kronecker approximation for deep learning. InInternational Conference on Learning Representations, 2024

work page 2024
[20]

Glentis, J

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review arXiv 2025
[21]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850, 2018

work page 2018
[22]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011

work page 2011
[23]

Y. Hao, Y. Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. In International Conference on Machine Learning, 2024

work page 2024
[24]

C. He, Z. Lu, D. Sun, and Z. Deng. Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise.arXiv preprint arXiv:2506.11214, 2025

work page arXiv 2025
[25]

Hinton, N

G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14(8):2, 2012

work page 2012
[26]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

work page 2022
[27]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024
[28]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon. 23

work page
[29]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015

work page 2015
[30]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025
[31]

T. T.-K. Lau, Q. Long, and W. Su. PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

work page arXiv 2025
[32]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

work page 2015
[33]

LeCun, B

Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropaga- tion applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989

work page 1989
[34]

Li and M

J. Li and M. Hong. A note on the convergence of Muon and further.arXiv e-prints, pages arXiv–2502, 2025

work page 2025
[35]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

L. Liu, Z. Xu, Z. Zhang, H. Kang, Z. Li, C. Liang, W. Chen, and T. Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

work page arXiv 2025
[37]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[38]

C. Ma, W. Gong, M. Scetbon, and E. Meeds. SWAN: Preprocessing SGD enables Adam-level performance on LLM training with significant memory reduction.arXiv e-prints, pages arXiv–2412, 2024

work page 2024
[39]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. InAdvances in Neural Information Processing Systems, volume 36, pages 53038–53075, 2023

work page 2023
[40]

Randomized methods for matrix computations

P.-G. Martinsson. Randomized methods for matrix computations.arXiv preprint arXiv:1607.01649, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

Training Deep Learning Models with Norm-Constrained LMOs

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs.arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025
[42]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[43]

Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richt´ arik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[44]

Rokhlin, A

V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2010. 24

work page 2010
[45]

Rosenblatt

F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386, 1958

work page 1958
[46]

N. Sato, H. Naganuma, and H. Iiduka. Analysis of Muon’s convergence and critical batch size.arXiv preprint arXiv:2507.01598, 2025

work page arXiv 2025
[47]

Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

M.-E. Sfyraki and J.-K. Wang. Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv preprint arXiv:2506.04192, 2025

work page arXiv 2025
[48]

W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang. On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025
[50]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi` ere, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

arXiv preprint arXiv:2202.07052 , year=

M. Tuddenham, A. Pr¨ ugel-Bennett, and J. Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022

work page arXiv 2022
[52]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in neural information processing systems, volume 30, 2017

work page 2017
[53]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing Shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025

work page 2025
[54]

S. Xie, T. Wang, S. J. Reddi, S. Kumar, and Z. Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537, 2025

work page arXiv 2025
[55]

M. D. Zeiler. Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[56]

Zhang, S

J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models? InAdvances in Neural Information Processing Systems, volume 33, pages 15383–15393, 2020

work page 2020
[57]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InInternational Conference on Machine Learning, volume 235, pages 61121–61143, 2024. A Low-rank orthogonalization procedures In this part, we introduce two new low-rank orthogonalization methods as alternatives to Algorit...

work page 2024

[1] [2]

K. Ahn, B. Xu, N. Abreu, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025

[2] [3]

E. AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, A. Tanwer, D. J. Shah, K. Nguyen, K. Smith, M. Callahan, M. Pust, M. Parmar, P. Rushton, P. Mazarakis, R. Kapila, S. Srivastava, S. Singla, T. Romanski, Y. Vanjani, and A. Vaswani. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025

[3] [4]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, Mar. 2025. Preprint

work page arXiv 2025

[4] [5]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002

[5] [6]

Old Optimizer, New Norm: An Anthology

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [7]

Bian, J.-F

F. Bian, J.-F. Cai, and R. Zhang. A preconditioned Riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45(4):2075–2103, 2024

work page 2075

[7] [8]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [9]

L. Bottou. Large-scale machine learning with stochastic gradient descent. InInternational Conference on Computational Statistics, pages 177–186. Springer, 2010

work page 2010

[9] [10]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. InInternational Conference on Artificial Intelligence and Statistics, pages 111–119, 2015

work page 2015

[10] [11]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2015. 22

work page 2015

[11] [12]

D. E. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in neural information processing systems, volume 28, 2015

work page 2015

[12] [13]

Carmon, J

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1):71–120, 2020

work page 2020

[13] [14]

F. L. Cesista. Muon and a selective survey on steepest descent in Riemannian and non-Riemannian manifolds, 2025

work page 2025

[14] [15]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[15] [16]

Drineas, R

P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approxi- mating matrix multiplication.SIAM Journal on Computing, 36(1):132–157, 2006

work page 2006

[16] [17]

Drineas, R

P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix.SIAM Journal on Computing, 36(1):158–183, 2006

work page 2006

[17] [18]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

work page 2011

[18] [19]

S. S. S. Duvvuri, F. Devvrit, R. Anil, C. Hsieh, and I. S. Dhillon. Combining axes preconditioners through Kronecker approximation for deep learning. InInternational Conference on Learning Representations, 2024

work page 2024

[19] [20]

Glentis, J

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review arXiv 2025

[20] [21]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850, 2018

work page 2018

[21] [22]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011

work page 2011

[22] [23]

Y. Hao, Y. Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. In International Conference on Machine Learning, 2024

work page 2024

[23] [24]

C. He, Z. Lu, D. Sun, and Z. Deng. Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise.arXiv preprint arXiv:2506.11214, 2025

work page arXiv 2025

[24] [25]

Hinton, N

G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14(8):2, 2012

work page 2012

[25] [26]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

work page 2022

[26] [27]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024

[27] [28]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon. 23

work page

[28] [29]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015

work page 2015

[29] [30]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025

[30] [31]

T. T.-K. Lau, Q. Long, and W. Su. PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

work page arXiv 2025

[31] [32]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

work page 2015

[32] [33]

LeCun, B

Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropaga- tion applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989

work page 1989

[33] [34]

Li and M

J. Li and M. Hong. A note on the convergence of Muon and further.arXiv e-prints, pages arXiv–2502, 2025

work page 2025

[34] [35]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [36]

L. Liu, Z. Xu, Z. Zhang, H. Kang, Z. Li, C. Liang, W. Chen, and T. Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

work page arXiv 2025

[36] [37]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[37] [38]

C. Ma, W. Gong, M. Scetbon, and E. Meeds. SWAN: Preprocessing SGD enables Adam-level performance on LLM training with significant memory reduction.arXiv e-prints, pages arXiv–2412, 2024

work page 2024

[38] [39]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. InAdvances in Neural Information Processing Systems, volume 36, pages 53038–53075, 2023

work page 2023

[39] [40]

Randomized methods for matrix computations

P.-G. Martinsson. Randomized methods for matrix computations.arXiv preprint arXiv:1607.01649, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [41]

Training Deep Learning Models with Norm-Constrained LMOs

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs.arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025

[41] [42]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019

[42] [43]

Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richt´ arik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025

[43] [44]

Rokhlin, A

V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2010. 24

work page 2010

[44] [45]

Rosenblatt

F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386, 1958

work page 1958

[45] [46]

N. Sato, H. Naganuma, and H. Iiduka. Analysis of Muon’s convergence and critical batch size.arXiv preprint arXiv:2507.01598, 2025

work page arXiv 2025

[46] [47]

Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

M.-E. Sfyraki and J.-K. Wang. Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv preprint arXiv:2506.04192, 2025

work page arXiv 2025

[47] [48]

W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang. On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [49]

C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025

[49] [50]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi` ere, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [51]

arXiv preprint arXiv:2202.07052 , year=

M. Tuddenham, A. Pr¨ ugel-Bennett, and J. Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022

work page arXiv 2022

[51] [52]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in neural information processing systems, volume 30, 2017

work page 2017

[52] [53]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing Shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025

work page 2025

[53] [54]

S. Xie, T. Wang, S. J. Reddi, S. Kumar, and Z. Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537, 2025

work page arXiv 2025

[54] [55]

M. D. Zeiler. Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[55] [56]

Zhang, S

J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models? InAdvances in Neural Information Processing Systems, volume 33, pages 15383–15393, 2020

work page 2020

[56] [57]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InInternational Conference on Machine Learning, volume 235, pages 61121–61143, 2024. A Low-rank orthogonalization procedures In this part, we introduce two new low-rank orthogonalization methods as alternatives to Algorit...

work page 2024