arxiv: 2605.06884 · v1 · submitted 2026-05-07 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

Sayantan Choudhury , Xiaoran Cheng , Martin Tak\'a\v{c} , Sen Na , Mladen Kolar

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords Muon optimizerNesterov momentumheavy-tailed noisepolar decompositionnon-convex optimizationmatrix optimizationconvergence ratesNewton-Schulz

0 comments

The pith

Muon optimizer with Nesterov momentum achieves optimal convergence rates under heavy-tailed noise and inexact polar decomposition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper closes the theory-practice gap for the Muon optimizer by analyzing it with Nesterov momentum, approximate polar factors, and heavy-tailed stochastic gradients in non-convex matrix optimization. It introduces a unified framework that tracks how errors from practical polar approximations propagate, then proves that the method reaches an ε-stationary point with iteration complexity O(ε^(-(3α-2)/(α-1))) where α is the heavy-tail index. This justifies using efficient inexact and randomized variants in practice while retaining strong guarantees. The result matters for training neural networks where gradients often exhibit heavy tails and full polar decompositions are too costly.

Core claim

Under a unified framework for inexact polar decomposition, Muon with Nesterov momentum finds an ε-stationary point in O(ε^(-(3α-2)/(α-1))) iterations and samples when the stochastic gradients have heavy tails with index α ∈ (1,2]. For the case of inexact polar with σ1=0, the guarantees hold without knowing α in advance. A randomized low-rank version is also analyzed and shown to be compatible with the theory.

What carries the argument

The unified framework for inexact polar decomposition that quantifies error propagation from iterative approximations such as Newton-Schulz through the optimization dynamics.

Load-bearing premise

The heavy-tailed noise must satisfy moment conditions indexed by α in (1,2] and the approximation errors from inexact polar decomposition must remain controlled as they propagate through the dynamics.

What would settle it

An experiment on a synthetic non-convex matrix problem with known heavy-tail index α where the observed iterations to reach ε-stationarity scale worse than the predicted O(ε^(-(3α-2)/(α-1))) would falsify the complexity bound.

Figures

Figures reproduced from arXiv: 2605.06884 by Martin Tak\'a\v{c}, Mladen Kolar, Sayantan Choudhury, Sen Na, Xiaoran Cheng.

**Figure 1.** Figure 1: Convergence trajectories on the two benchmarks. (a,b) nanoGPT: optimizer comparison and randomized Muon (Nesterov) at varying randomized rank s, reporting validation perplexity. (c,d) CIFAR-10: same comparison reporting validation loss. Figure 1a shows that all Muon variants clearly outperform AdamW and SGD-Nesterov. Moreover, the randomized Muon variants perform comparably to their full-space counterparts… view at source ↗

**Figure 2.** Figure 2: (a,b) nanoGPT: validation perplexity when varying the Newton-Schulz or PolarExpress inner-iteration count q and the randomized sketching. (c,d) CIFAR-10: validation loss for the same two ablations. Schulz, while quantifying how approximation errors influence the optimization dynamics. Under generalized heavy-tailed noise, we established an optimal O [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: nanoGPT ablations for randomized Muon with Nesterov momentum: validation perplexity versus training steps when each hyperparameter is varied in turn and the others are held at the default configuration. Each curve is the mean over 5 random seeds. The insets zoom into the final iterations for a clearer comparison. 0 100 200 300 400 500 600 700 Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Val Loss 680 690 700 710 720 0.… view at source ↗

**Figure 4.** Figure 4: CIFAR-10 ablations for randomized Muon with Nesterov momentum: validation loss versus training steps when each hyperparameter is varied in turn and the others are held at the default configuration. Each curve is the mean over 50 random seeds. The insets zoom into the final iterations for a clearer comparison. F Limitations The inexact-polar theory depends on the abstract quantities γk and νk, but we do not… view at source ↗

read the original abstract

Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3\alpha-2)}{(\alpha-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $\alpha\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $\sigma_1=0$, we also provide guarantees that do not require prior knowledge of $\alpha$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first rates for Muon with Nesterov momentum and inexact polar under heavy tails, but the nonlinear error propagation may not preserve the α-moment bounds cleanly.

read the letter

The paper gives the first convergence theory for Muon that folds in Nesterov momentum, inexact polar decomposition, and heavy-tailed stochastic gradients. It states an iteration and sample complexity of O(ε^(-(3α-2)/(α-1))) for ε-stationary points, plus a guarantee that avoids knowing α when the polar error term is zero. They also analyze a randomized low-rank polar variant that stays inside the same framework and should be cheaper in practice.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a convergence theory for Muon with Nesterov momentum and inexact (including randomized low-rank) polar decomposition in non-convex matrix optimization under heavy-tailed stochastic gradients. It builds on a unified framework for inexact polar decomposition (capturing Newton-Schulz and similar methods) that quantifies error propagation, and claims an optimal iteration/sample complexity of O(ε^{-(3α-2)/(α-1)}) for an ε-stationary point where α∈(1,2] is the heavy-tail index; it also gives α-independent guarantees for the σ1=0 inexact-polar case.

Significance. If the error-propagation analysis holds, the work supplies the first rigorous rates for a geometry-aware optimizer under heavy tails, which is significant given the prevalence of matrix parameters in neural networks and the practical use of inexact polar factors. The unified framework, the randomized low-rank variant, and the numerical experiments are strengths that enhance applicability.

major comments (2)

[Section 3] Section 3 (unified framework for inexact polar decomposition): the framework bounds polar errors additively and feeds them into the heavy-tail descent lemma, but lacks an explicit lemma establishing that the nonlinear polar factor (via singular vectors) preserves the α-moment condition on the effective noise when the momentum matrix has only α-moments for α<2. This preservation is load-bearing for the claimed rate in Theorem 4.1, as heavier tails in the composite update could introduce extra factors or invalidate the O(ε^{-(3α-2)/(α-1)}) bound.
[Proof of Theorem 4.1] Proof of Theorem 4.1: the analysis treats the inexact-polar error as an additive perturbation whose norm is controlled, yet provides no uniform-in-iteration bound showing that the α-moment of the composed noise remains finite and independent of conditioning; without this, the optimality claim and the α-independent guarantee for σ1=0 may not hold as stated.

minor comments (2)

Notation for the heavy-tail index α and the parameters σ1, σ2 could be introduced earlier with a brief reminder of their roles in the moment assumptions.
Figure captions for the numerical experiments should explicitly state the matrix dimensions and the value of α used in the synthetic heavy-tailed noise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the major comments point by point below. We will revise the manuscript to add explicit statements on moment preservation for improved clarity.

read point-by-point responses

Referee: [Section 3] Section 3 (unified framework for inexact polar decomposition): the framework bounds polar errors additively and feeds them into the heavy-tail descent lemma, but lacks an explicit lemma establishing that the nonlinear polar factor (via singular vectors) preserves the α-moment condition on the effective noise when the momentum matrix has only α-moments for α<2. This preservation is load-bearing for the claimed rate in Theorem 4.1, as heavier tails in the composite update could introduce extra factors or invalidate the O(ε^{-(3α-2)/(α-1)}) bound.

Authors: We appreciate the referee highlighting the need for an explicit lemma on this point. The polar factor has operator norm exactly 1, so the effective update direction is bounded in norm by 1; this immediately yields that its α-moment is at most 1 (uniformly in the iteration and independent of conditioning). The additive inexact-polar error is controlled separately by the framework. To address the concern directly, we will insert an explicit lemma in the revised Section 3 establishing preservation of the α-moment for the composed noise term under the stated conditions. revision: yes
Referee: [Proof of Theorem 4.1] Proof of Theorem 4.1: the analysis treats the inexact-polar error as an additive perturbation whose norm is controlled, yet provides no uniform-in-iteration bound showing that the α-moment of the composed noise remains finite and independent of conditioning; without this, the optimality claim and the α-independent guarantee for σ1=0 may not hold as stated.

Authors: We agree that an explicit uniform-in-iteration bound strengthens the proof. Because the polar factor is bounded by norm 1, the α-moment of the composed noise (polar factor plus additive inexact error) is bounded by a constant such as (1 + σ1)^α that is independent of iteration and conditioning. We will add this uniform bound explicitly to the proof of Theorem 4.1 (with a cross-reference to the new lemma in Section 3), which also secures the α-independent guarantee when σ1 = 0. revision: yes

Circularity Check

0 steps flagged

Analysis builds on external unified framework with no self-referential reduction in rates

full rationale

The paper derives optimal iteration/sample complexity O(ε^(-(3α-2)/(α-1))) for ε-stationary points under heavy-tailed noise (α∈(1,2]) by building on a unified framework for inexact polar decomposition (Newton-Schulz etc.) that quantifies error propagation. No quoted equations or steps show the final bound reducing by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain; the framework is invoked as external support, and the σ1=0 case even removes the need for prior α knowledge. This leaves the central claim with independent analytical content, consistent with only minor self-citation at most.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the central claim rests on standard domain assumptions for heavy-tailed stochastic optimization and controlled propagation of approximation errors in polar decomposition.

axioms (2)

domain assumption Stochastic gradients are heavy-tailed with index α ∈ (1,2]
Invoked to obtain the stated complexity that depends on α.
domain assumption Inexact polar decomposition errors are bounded and propagate controllably through the dynamics
Required for the unified framework to yield the claimed rates.

pith-pipeline@v0.9.0 · 5567 in / 1439 out tokens · 72769 ms · 2026-05-11T00:47:50.040062+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish an optimal iteration and sample complexity of O(ε^−(3α−2)/(α−1)) for finding an ε-stationary point, where α∈(1,2] denotes the heavy-tail index. ... unified framework for inexact polar decomposition ... Newton-Schulz
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 2.1 ... E[⟨Mk,T(Mk)⟩|Mk]≥(1−γk)∥Mk∥∗ and E[∥T(Mk)∥²op|Mk]≤(1+νk)²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 14 canonical work pages · 5 internal anchors

[1]

K. Ahn, X. Cheng, M. Song, C. Yun, A. Jadbabaie, and S. Sra. Linear attention is (maybe) all you need (to understand transformer optimization).International Conference on Learning Representations, 2024

2024
[2]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations, 2026

2026
[3]

Battash, L

B. Battash, L. Wolf, and O. Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024

2024
[4]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology.NeurIPS Workshop on Optimization for Machine Learning, 2024

2024
[5]

On the Convergence of Muon and Beyond

D. Chang, Y . Liu, and G. Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review arXiv 2025
[6]

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y . Lu, et al. Symbolic discovery of optimization algorithms.Advances in neural information processing systems, 36:49205–49233, 2023

2023
[7]

Chezhegov, Y

S. Chezhegov, Y . Klyukin, A. Semenov, A. Beznosikov, A. Gasnikov, S. Horváth, M. Takáˇc, and E. Gorbunov. Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed. International Conference on Machine Learning, 2025

2025
[8]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[9]

Drineas, R

P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication.SIAM Journal on Computing, 36(1):132–157, Jan. 2006

2006
[10]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

2013
[11]

Gupta, T

V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

2018
[12]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

2011
[13]

Y . Hao, Y . Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. International Conference on Machine Learning, 2024

2024
[14]

C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

1986
[16]

Hinton, N

G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.Cited on, 14(8):2, 2012

2012
[17]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems, 2022

2022
[18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Repre- sentations, 1(2):3, 2022

2022
[19]

Hübler, I

F. Hübler, I. Fatkhullin, and N. He. From gradient clipping to normalization for heavy tailed sgd.arXiv preprint arXiv:2410.13849, 2024. 11

work page arXiv 2024
[20]

Johnson and T

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

2013
[21]

K. Jordan. 94% on CIFAR-10 in 3.29 seconds on a single gpu.arXiv preprint arXiv:2404.00498, 2024

work page arXiv 2024
[22]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, V . Boza, J. You, F. Cesista, and B. Koszarsky. Modded- NanoGPT: Speedrunning the NanoGPT baseline. https://github.com/KellerJordan/ modded-nanogpt, 2024. GitHub repository, accessed 2026-05-01

2024
[23]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024
[24]

D. P. Kingma. Adam: A method for stochastic optimization.International Conference on Learning Representations, 2015

2015
[25]

Kornilov, O

N. Kornilov, O. Shamir, A. Lobanov, D. Dvinskikh, A. Gasnikov, I. Shibaev, E. Gorbunov, and S. Horváth. Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36: 64083–64102, 2023

2023
[26]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025
[27]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009
[28]

G. Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 133(1):365–397, 2012

2012
[29]

Lan.First-order and stochastic optimization methods for machine learning, volume 1

G. Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020

2020
[30]

Jiaxiang Li and Mingyi Hong

J. Li and M. Hong. A note on the convergence of muon.arXiv preprint arXiv:2502.02900, 2025

work page arXiv 2025
[31]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review arXiv 2025
[32]

Liu and Z

Z. Liu and Z. Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping.International Conference on Learning Representations, 2025

2025
[33]

Loizou, S

N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. InInternational conference on artificial intelligence and statistics, pages 1306–1314. PMLR, 2021

2021
[34]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.International Conference on Learning Representations, 2019

2019
[35]

Narayanan, M

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021

2021
[36]

Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization

Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, 2004

2004
[37]

Nguyen, J

L. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InIn 34th International Conference on Machine Learning, ICML 2017, 2017. 12

2017
[38]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Syst...

2019
[39]

Penedo, H

G. Penedo, H. Kydlí ˇcek, A. Lozhkov, M. Mitchell, C. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

2024
[40]

Pethick, W

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V . Cevher. Training deep learning models with norm-constrained lmos.International Conference on Machine Learning, 2025

2025
[41]

PyTorch documentation: torch.optim.Muon, 2026

PyTorch Contributors. PyTorch documentation: torch.optim.Muon, 2026. URLhttps://docs. pytorch.org/docs/stable/generated/torch.optim.Muon.html. Accessed: 2026-04- 28

2026
[42]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[43]

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[44]

N. Sato, H. Naganuma, and H. Iiduka. Convergence bound and critical batch size of muon optimizer.arXiv preprint arXiv:2507.01598, 2025

work page arXiv 2025
[45]

W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

E. Shulgin, S. AlRashed, F. Orabona, and P. Richtárik. Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

work page arXiv 2025
[47]

Simsekli, L

U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019
[48]

Strohmer and R

T. Strohmer and R. Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, Apr. 2008. ISSN 1531-5851. doi: 10.1007/s00041-008-9030-4

work page doi:10.1007/s00041-008-9030-4 2008
[49]

Takáˇc, A

M. Takáˇc, A. Bijral, P. Richtárik, and N. Srebro. Mini-batch primal and dual methods for svms. InIn 30th International Conference on Machine Learning, ICML 2013, 2013

2013
[50]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[52]

N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, D. Brandfonbrener, L. Janson, and S. Kakade. Soap: Improving and stabilizing shampoo using adam.International Conference on Learning Representations, 2025

2025
[53]

R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes.Journal of Machine Learning Research, 21(219):1–30, 2020

2020
[54]

Zhang, S

J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33: 15383–15393, 2020

2020
[55]

Zhao et al

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024. 13 Appendix Contents 1 Introduction 1 2 (Randomized) Inexact Polar Decomposition 3 3 Main Theoretical Results 5 3.1 Warm-up: Analysis for Exact Polar Decomposition . . . . . . . . . ...

work page arXiv 2024