pith. sign in

arxiv: 2509.11983 · v2 · submitted 2025-09-15 · 💻 cs.LG · math.OC

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Pith reviewed 2026-05-18 15:43 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords low-rank orthogonalizationMuon optimizerfoundation model trainingmatrix optimizationneural network trainingstochastic optimizationgradient descentiteration complexity
0
0 comments X

The pith

Low-rank orthogonalization exploits the low-rank structure of gradients to create a Muon variant that outperforms the original on large foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes low-rank orthogonalization, which performs matrix orthogonalization by leveraging the low-rank nature of gradients during neural network training. This forms the basis for low-rank matrix-signed gradient descent and a low-rank variant of Muon. Experiments on GPT-2 and LLaMA pretraining demonstrate that low-rank Muon surpasses carefully tuned vanilla Muon, particularly on tasks with large model sizes. The work also derives iteration complexity bounds for low-rank MSGD to reach approximate stationary solutions and for low-rank Muon under heavy-tailed stochastic noise.

Core claim

By replacing full orthogonalization with a low-rank version that uses the low-rank structure of gradients, low-rank MSGD achieves provable iteration complexity for approximate stationary points while low-rank Muon does the same for approximate stochastic stationary points under heavy-tailed noise. Numerical results show this yields superior performance over vanilla Muon in GPT-2 and LLaMA pretraining at large scales.

What carries the argument

low-rank orthogonalization, the process of orthogonalizing gradient matrices after projecting them to a low-rank subspace to exploit their observed low-rank structure during training

If this is right

  • Low-rank Muon surpasses vanilla Muon on GPT-2 and LLaMA pretraining tasks with large model sizes.
  • Low-rank MSGD reaches an approximate stationary solution with established iteration complexity.
  • Low-rank Muon reaches an approximate stochastic stationary solution under heavy-tailed noise with established iteration complexity.
  • The low-rank approach applies directly to other large-scale matrix optimization problems arising in neural network training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the low-rank gradient property persists at even larger scales, the method could cut memory and compute costs for orthogonalization steps in models beyond current LLaMA sizes.
  • Similar low-rank projections might improve other matrix-aware optimizers that rely on orthogonalization or signed updates.
  • The assumption could be validated by tracking the effective rank of gradients across different model architectures and datasets to identify when the speedup is reliable.

Load-bearing premise

Gradients during neural network training have enough low-rank structure that replacing full orthogonalization with a low-rank version preserves convergence and model quality.

What would settle it

A controlled experiment on a large LLaMA pretraining run where low-rank Muon converges to a strictly worse loss or requires substantially more steps than vanilla Muon would falsify the practical claim.

Figures

Figures reproduced from arXiv: 2509.11983 by Chuan He, Zhanwang Deng, Zhaosong Lu.

Figure 1
Figure 1. Figure 1: Left: Comparison of GPU computation time across Newton-Schulz iterations (NS), our low-rank [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Time distribution for low-rank orthogonalization with Gaussian sketching, including the QR [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of variance levels in matrix sign estimation across Newton-Schulz iterations (NS), [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Objective values (top row) and gradient Frobenius norms (bottom row) for all methods during [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Singular value distributions of the Q, K, and V matrices: across layers at iteration 300 (top [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of validation perplexity versus computational time for all competing methods in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Singular value distributions of the Q, K, and V matrices: across layers at iteration 300 (top [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of validation perplexity and computational time for all competing methods in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose \textit{low-rank orthogonalization}, which performs orthogonalization by leveraging the low-rank nature of gradients during NN training. Building on this, we introduce low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of Muon. Numerical experiments demonstrate the superior performance of low-rank orthogonalization, with low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the carefully tuned vanilla Muon on tasks with large model sizes. Theoretically, we establish the iteration complexity of low-rank MSGD for finding an approximate stationary solution, and the iteration complexity of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise. The code to reproduce our numerical experiments is available at https://github.com/dengzhanwang/Low-rank-Muon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes low-rank orthogonalization to exploit the low-rank structure of gradients in neural network training. It introduces low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of the Muon optimizer, reports empirical results showing low-rank Muon outperforming carefully tuned vanilla Muon in GPT-2 and LLaMA pretraining on large models, and derives iteration complexity bounds for low-rank MSGD (approximate stationary points) and low-rank Muon (approximate stochastic stationary points under heavy-tailed noise). Code is provided for reproducibility.

Significance. If the low-rank gradient structure is verified and the performance gains hold under rigorous controls, the work could improve computational efficiency of orthogonalization steps in large-scale foundation model training without sacrificing convergence quality. The availability of reproduction code and the extension of Muon to low-rank settings are positive contributions to the matrix-optimization literature for deep learning.

major comments (2)
  1. [Numerical Experiments] The central empirical claim (low-rank Muon surpassing tuned vanilla Muon on large-model tasks) and the theoretical iteration bounds both rest on the premise that gradients possess exploitable low-rank structure. However, the manuscript reports neither the effective numerical rank nor the fraction of Frobenius norm captured by the top-k singular vectors at any training step in the GPT-2 or LLaMA experiments. Without these diagnostics, observed gains could arise from incidental implementation differences or hyper-parameter effects rather than the low-rank mechanism.
  2. [Theoretical Analysis] §4 (theoretical analysis): the iteration complexity statements for low-rank MSGD and low-rank Muon appear to follow from standard stochastic optimization arguments applied to a truncated orthogonalization operator. The derivation does not explicitly quantify how the rank truncation threshold affects the constants in the complexity bounds or the bias introduced relative to full orthogonalization.
minor comments (2)
  1. [Introduction] The abstract and introduction use the phrase 'low-rank nature of gradients' without a precise definition or reference to prior work quantifying this property in transformer training.
  2. [Numerical Experiments] Figure captions and experimental tables should explicitly state the rank truncation threshold used for each model size and whether it was tuned or fixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the empirical validation of the low-rank structure and the transparency of the theoretical analysis. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Numerical Experiments] The central empirical claim (low-rank Muon surpassing tuned vanilla Muon on large-model tasks) and the theoretical iteration bounds both rest on the premise that gradients possess exploitable low-rank structure. However, the manuscript reports neither the effective numerical rank nor the fraction of Frobenius norm captured by the top-k singular vectors at any training step in the GPT-2 or LLaMA experiments. Without these diagnostics, observed gains could arise from incidental implementation differences or hyper-parameter effects rather than the low-rank mechanism.

    Authors: We agree that providing these diagnostics is important to substantiate the low-rank premise. In the revised version, we will add new figures and tables in the experimental section that report, for multiple training steps in both the GPT-2 and LLaMA runs, the effective numerical rank (defined via singular values exceeding a small threshold relative to the largest) and the cumulative fraction of Frobenius norm captured by the top-k singular vectors. These additions will be computed from the existing experimental code and will directly address whether the observed gains align with the low-rank gradient structure. revision: yes

  2. Referee: [Theoretical Analysis] §4 (theoretical analysis): the iteration complexity statements for low-rank MSGD and low-rank Muon appear to follow from standard stochastic optimization arguments applied to a truncated orthogonalization operator. The derivation does not explicitly quantify how the rank truncation threshold affects the constants in the complexity bounds or the bias introduced relative to full orthogonalization.

    Authors: We appreciate this point on the need for greater explicitness. Although the current proofs apply standard stochastic optimization techniques to the low-rank truncated operator and control the truncation error via the assumed low-rank gradient structure, we will revise §4 to include an additional lemma or remark that explicitly bounds the dependence of the complexity constants on the rank threshold k. This will also quantify the bias relative to full orthogonalization in terms of the tail singular values, under the paper's low-rank gradient assumption, thereby clarifying the relationship between the low-rank and full-rank cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained

full rationale

The paper motivates low-rank orthogonalization from the empirical observation that gradients in NN training exhibit low-rank structure, then defines low-rank MSGD and low-rank Muon accordingly. Iteration complexity results for approximate stationary points are obtained by applying standard stochastic optimization arguments to these modified updates under heavy-tailed noise, without reducing to fitted parameters, self-definitional loops, or load-bearing self-citations. The central empirical claims rest on numerical experiments with GPT-2 and LLaMA rather than tautological renaming or imported uniqueness theorems. No step equates a prediction to its input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on the empirical observation that gradients are low-rank during training and on standard assumptions from stochastic optimization theory; no new particles or forces are postulated.

free parameters (1)
  • rank truncation threshold
    The specific rank used for the low-rank approximation is chosen based on observed gradient structure and is not derived from first principles.
axioms (2)
  • domain assumption Gradients encountered during neural network training admit a useful low-rank approximation.
    This premise is invoked to motivate replacing full orthogonalization with the low-rank variant.
  • standard math Standard stochastic gradient assumptions hold under heavy-tailed noise.
    Used to establish the iteration complexity of low-rank Muon.
invented entities (1)
  • low-rank orthogonalization operator no independent evidence
    purpose: Efficiently approximate matrix orthogonalization by operating only on dominant singular directions.
    New algorithmic primitive introduced in the paper.

pith-pipeline@v0.9.0 · 5745 in / 1454 out tokens · 35522 ms · 2026-05-18T15:43:52.179357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

    cs.LG 2026-05 conditional novelty 7.0

    Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

  2. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

    math.OC 2026-05 unverdicted novelty 7.0

    Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

  3. Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

    math.OC 2026-05 unverdicted novelty 6.0

    Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.

  4. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

    cs.LG 2026-05 unverdicted novelty 5.0

    MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.

  5. Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

    cs.LG 2026-05 unverdicted novelty 5.0

    Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 5 Pith papers · 9 internal anchors

  1. [2]

    K. Ahn, B. Xu, N. Abreu, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

  2. [3]

    E. AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, A. Tanwer, D. J. Shah, K. Nguyen, K. Smith, M. Callahan, M. Pust, M. Parmar, P. Rushton, P. Mazarakis, R. Kapila, S. Srivastava, S. Singla, T. Romanski, Y. Vanjani, and A. Vaswani. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

  3. [4]

    K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, Mar. 2025. Preprint

  4. [5]

    R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

  5. [6]

    Old Optimizer, New Norm: An Anthology

    J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  6. [7]

    Bian, J.-F

    F. Bian, J.-F. Cai, and R. Zhang. A preconditioned Riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45(4):2075–2103, 2024

  7. [8]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

  8. [9]

    L. Bottou. Large-scale machine learning with stochastic gradient descent. InInternational Conference on Computational Statistics, pages 177–186. Springer, 2010

  9. [10]

    Carlson, V

    D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. InInternational Conference on Artificial Intelligence and Statistics, pages 111–119, 2015

  10. [11]

    Carlson, Y.-P

    D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2015. 22

  11. [12]

    D. E. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in neural information processing systems, volume 28, 2015

  12. [13]

    Carmon, J

    Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1):71–120, 2020

  13. [14]

    F. L. Cesista. Muon and a selective survey on steepest descent in Riemannian and non-Riemannian manifolds, 2025

  14. [15]

    L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

  15. [16]

    Drineas, R

    P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approxi- mating matrix multiplication.SIAM Journal on Computing, 36(1):132–157, 2006

  16. [17]

    Drineas, R

    P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix.SIAM Journal on Computing, 36(1):158–183, 2006

  17. [18]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

  18. [19]

    S. S. S. Duvvuri, F. Devvrit, R. Anil, C. Hsieh, and I. S. Dhillon. Combining axes preconditioners through Kronecker approximation for deep learning. InInternational Conference on Learning Representations, 2024

  19. [20]

    Glentis, J

    A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

  20. [21]

    Gupta, T

    V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850, 2018

  21. [22]

    Halko, P.-G

    N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011

  22. [23]

    Y. Hao, Y. Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. In International Conference on Machine Learning, 2024

  23. [24]

    C. He, Z. Lu, D. Sun, and Z. Deng. Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise.arXiv preprint arXiv:2506.11214, 2025

  24. [25]

    Hinton, N

    G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14(8):2, 2012

  25. [26]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

  26. [27]

    Jordan, J

    K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

  27. [28]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon. 23

  28. [29]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015

  29. [30]

    D. Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

  30. [31]

    T. T.-K. Lau, Q. Long, and W. Su. PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

  31. [32]

    LeCun, Y

    Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

  32. [33]

    LeCun, B

    Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropaga- tion applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989

  33. [34]

    Li and M

    J. Li and M. Hong. A note on the convergence of Muon and further.arXiv e-prints, pages arXiv–2502, 2025

  34. [35]

    J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

  35. [36]

    L. Liu, Z. Xu, Z. Zhang, H. Kang, Z. Li, C. Liang, W. Chen, and T. Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

  36. [37]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  37. [38]

    C. Ma, W. Gong, M. Scetbon, and E. Meeds. SWAN: Preprocessing SGD enables Adam-level performance on LLM training with significant memory reduction.arXiv e-prints, pages arXiv–2412, 2024

  38. [39]

    Malladi, T

    S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. InAdvances in Neural Information Processing Systems, volume 36, pages 53038–53075, 2023

  39. [40]

    Randomized methods for matrix computations

    P.-G. Martinsson. Randomized methods for matrix computations.arXiv preprint arXiv:1607.01649, 2016

  40. [41]

    Training Deep Learning Models with Norm-Constrained LMOs

    T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs.arXiv preprint arXiv:2502.07529, 2025

  41. [42]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

  42. [43]

    Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

    A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richt´ arik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

  43. [44]

    Rokhlin, A

    V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2010. 24

  44. [45]

    Rosenblatt

    F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386, 1958

  45. [46]

    N. Sato, H. Naganuma, and H. Iiduka. Analysis of Muon’s convergence and critical batch size.arXiv preprint arXiv:2507.01598, 2025

  46. [47]

    Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

    M.-E. Sfyraki and J.-K. Wang. Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv preprint arXiv:2506.04192, 2025

  47. [48]

    W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang. On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

  48. [49]

    C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

  49. [50]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi` ere, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  50. [51]

    arXiv preprint arXiv:2202.07052 , year=

    M. Tuddenham, A. Pr¨ ugel-Bennett, and J. Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022

  51. [52]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in neural information processing systems, volume 30, 2017

  52. [53]

    N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing Shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025

  53. [54]

    S. Xie, T. Wang, S. J. Reddi, S. Kumar, and Z. Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537, 2025

  54. [55]

    M. D. Zeiler. Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

  55. [56]

    Zhang, S

    J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models? InAdvances in Neural Information Processing Systems, volume 33, pages 15383–15393, 2020

  56. [57]

    J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InInternational Conference on Machine Learning, volume 235, pages 61121–61143, 2024. A Low-rank orthogonalization procedures In this part, we introduce two new low-rank orthogonalization methods as alternatives to Algorit...