Why Muon Outperforms Adam: A Curvature Perspective

Dirk Bergemann; Fengzhuo Zhang; Jiaxiang Li; Shuche Wang; Zhuoran Yang

arxiv: 2606.04662 · v1 · pith:DCXPIFM3new · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Why Muon Outperforms Adam: A Curvature Perspective

Shuche Wang , Fengzhuo Zhang , Jiaxiang Li , Dirk Bergemann , Zhuoran Yang This is my paper

Pith reviewed 2026-06-28 07:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Muon optimizerAdam optimizernormalized directional sharpnesscurvature analysisLLM trainingquadratic problemsdata imbalancesecond-order Taylor expansion

0 comments

The pith

Muon outperforms Adam by achieving smaller curvature penalty through lower normalized directional sharpness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muon's training advantage over Adam stems from a smaller second-order curvature penalty in the loss landscape. Although both optimizers deliver similar first-order improvements, Muon reduces the normalized directional sharpness component of that penalty. This reduction is strengthened by imbalanced training data and occurs mainly through lower within-layer curvature in later training phases. Analysis of quadratic problems with heterogeneous curvature proves that Muon lowers average NDS compared to gradient descent by spreading update energy evenly, which produces a smaller local loss when curvature differences are large enough.

Core claim

Applying a second-order Taylor expansion, the paper demonstrates that Muon produces a larger one-step loss decrease than Adam at the same validation loss because its curvature penalty is smaller. This advantage traces to lower Normalized Directional Sharpness rather than differences in update norm. In controlled experiments with Zipf-PCFG data, data imbalance widens the gap, and layer decomposition shows the effect concentrates in within-layer curvature during middle and late training. For stylized quadratics with heterogeneous curvature and gradient alignment to high-curvature modes, Muon is proven to achieve smaller average NDS than gradient descent by balancing update energy, resulting in

What carries the argument

Normalized Directional Sharpness (NDS), the curvature term in the second-order loss change normalized by the squared update norm, which Muon reduces by balancing update energy across curvature groups.

If this is right

Muon achieves larger one-step loss decrease than Adam at matched validation loss.
Data imbalance amplifies Muon's NDS advantage over Adam.
Muon's NDS reduction in middle and late stages is driven by smaller within-layer curvature.
In quadratic settings with strong curvature heterogeneity, balancing update energy yields lower local loss after equal steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The balancing mechanism may extend to other first-order methods that adjust updates according to curvature variation.
The result implies that ignoring curvature heterogeneity could limit performance in data regimes with strong imbalance.
Architecture modifications that reduce within-layer curvature differences could interact with or amplify such optimizer effects.

Load-bearing premise

The second-order Taylor approximation accurately captures the one-step loss decrease in the high-dimensional non-convex landscape of large language model training.

What would settle it

If measurements during actual LLM training show that Muon does not produce a larger one-step loss reduction than Adam when validation losses are matched, or that its NDS is not lower, the curvature explanation would not hold.

Figures

Figures reproduced from arXiv: 2606.04662 by Dirk Bergemann, Fengzhuo Zhang, Jiaxiang Li, Shuche Wang, Zhuoran Yang.

**Figure 2.** Figure 2: NDS and update-norm comparisons between Muon and Adam. Panel (a) plots NDS for Muon and Adam. Panel (b) plots the update norms of Muon and Adam. Panel (c) reports the Adam-to-Muon ratios of the curvature penalty, NDS, and the squared Frobenius norm of the update. Muon and Adam have similar update norms, whereas Muon has smaller NDS than Adam. Moreover, the Adam-to-Muon ratio of NDS closely tracks that of t… view at source ↗

**Figure 3.** Figure 3: Effect of different levels of imbalance ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Within-layer and cross-layer decomposition of directional sharpness over training. Panel (a) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical support for Assumption 5.1. Panel (a) reports the fraction of Frobenius energy explained by low-rank Kronecker approximations to the Hessians of the four attention matrices. Panel (b) visualizes the WV Hessian, its rank-4 Kronecker approximation, and the residual error. The results show that the Hessians of attention matrices can be well approximated by low-rank Kronecker products. Y ∈ R d1×d2 : … view at source ↗

**Figure 6.** Figure 6: Empirical support for Assumption 5.2– 5.4. Panel (a) shows the average values of the simultaneousdiagonalization score ηsd for {Ak} r k=1 and {Bk} r k=1. Panel (b) shows the value of positive curvatures. Panel (c) shows the cumulative gradient energy ratio ζ(i). These results support the approximate simultaneous diagonalization of the matrices in the Hessian decomposition, as well as the alignment of grad… view at source ↗

**Figure 7.** Figure 7: Panel (a) reports the NDS ratio, and Panel (b) reports the loss-decrease ratio. The results show that GD and Adam exhibit similar behavior in both NDS and loss decrease on quadratic problems satisfying Assumptions 5.1–5.4. Assumption 5.4 (Gradient Alignment). The gradient G is in the subspace spanned by {Mi} q i=1, i.e., G = Pq i=1 σiMi, where σi = ⟨G, Mi⟩. In addition, the coefficients σi have the same tw… view at source ↗

**Figure 8.** Figure 8: NDS and the corresponding ratio comparison along the training steps of Muon and Adam. Panel [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Layerwise localization of the Adam–Muon within-layer sharpness gap across the 12 Transformer [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: WQ Hessian, rank-4 Kronecker approximation, and residual. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: WK Hessian, rank-4 Kronecker approximation, and residual. 0 100 200 row index i 0 0 50 100 150 200 250 r o w i n d e x i mat(H) 0 100 200 row index i 0 Hr (r = 4) 0 100 200 row index i 0 mat(H) ¡ Hr 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1e 6 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: WO Hessian, rank-4 Kronecker approximation, and residual. D.2 Simultaneous Diagonalization (Assumption 5.2) We verify this assumption on the attention block, using Kronecker rank r = 4. This choice balances approximation quality and computational tractability where at r = 4, the Kronecker approximation already captures ξ(4) ≥ 0.87 of the Frobenius energy for three of the four attention matrices (Appendix … view at source ↗

read the original abstract

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon paper introduces NDS to split curvature penalty from update norm and proves balancing on heterogeneous quadratics, but the Taylor step that ties this to real LLM loss decrease is unverified.

read the letter

The core contribution is the Normalized Directional Sharpness measure and the clean split of the second-order term into update norm versus directional sharpness. They show Muon and Adam have similar norms at matched validation loss, so the smaller curvature penalty for Muon comes from lower NDS. The controlled Zipf-PCFG experiments tie the gap to data imbalance, and the within-layer analysis points to middle-to-late training behavior. The stylized quadratic proof is self-contained: Muon balances energy across curvature groups and beats GD on average NDS when heterogeneity is high enough.

The main limitation is the opening move. The abstract treats the second-order Taylor as directly giving the one-step loss decrease, then decomposes the penalty. In non-convex high-dimensional training that assumption is not checked against actual loss changes, so the geometric quantities do not yet have a verified link to observed performance. The quadratic result stands on its own but does not close that gap for the Adam comparison.

The work is aimed at people who already think about optimizer geometry and want a new explanatory variable. It is coherent on its own terms and the experiments are targeted rather than post-hoc, so it deserves a referee even if the Taylor bridge needs more evidence. I would send it out.

Referee Report

1 major / 2 minor

Summary. The paper claims Muon outperforms Adam in LLM training by achieving a larger one-step loss decrease under a second-order Taylor approximation of the loss landscape, driven by a smaller curvature penalty from lower Normalized Directional Sharpness (NDS) rather than comparable update norms. Empirical decompositions on real training runs and Zipf-PCFG data with controlled imbalance show data imbalance amplifies the NDS advantage, with within-layer curvature sustaining it in middle/late stages. A proof on stylized quadratic problems with heterogeneous curvature and gradient alignment proves Muon attains smaller average NDS than GD by balancing update energy across curvature groups, yielding lower local quadratic loss when heterogeneity is strong.

Significance. If the Taylor approximation reliably captures one-step dynamics, the work supplies a geometric account of optimizer differences via NDS and curvature heterogeneity, with credit due to the exact self-contained quadratic proof and the controlled empirical decomposition separating NDS from update norm. The introduction of NDS and the data-imbalance experiments provide mechanistic insight beyond standard optimizer comparisons.

major comments (1)

[Abstract (first analysis step)] Abstract (first analysis step): The central claim that Muon achieves larger one-step loss decrease than Adam (at matched validation loss) because of smaller curvature penalty rests on the second-order Taylor expansion accurately predicting the actual loss change. In high-dimensional non-convex LLM landscapes, higher-order terms, movement across curvature regions, or step-size effects may dominate; no verification comparing the quadratic prediction to observed loss decrease is provided, so the subsequent NDS decomposition does not yet reliably explain performance gaps.

minor comments (2)

The manuscript would benefit from an explicit early definition and formula for Normalized Directional Sharpness (NDS) before its use in the decomposition.
Figure captions for the within-/cross-layer NDS plots should state the exact training stages (e.g., step ranges) and number of runs averaged.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback on the reliability of the second-order Taylor approximation. We address the major comment below.

read point-by-point responses

Referee: The central claim that Muon achieves larger one-step loss decrease than Adam (at matched validation loss) because of smaller curvature penalty rests on the second-order Taylor expansion accurately predicting the actual loss change. In high-dimensional non-convex LLM landscapes, higher-order terms, movement across curvature regions, or step-size effects may dominate; no verification comparing the quadratic prediction to observed loss decrease is provided, so the subsequent NDS decomposition does not yet reliably explain performance gaps.

Authors: We agree that an explicit verification comparing the quadratic prediction to the observed one-step loss decrease would strengthen the central claim. While the NDS decomposition itself is performed using gradients and curvature estimates from actual training runs (rather than purely from the approximation), we did not include a direct side-by-side comparison of predicted versus realized loss change. In the revised manuscript we will add this verification: on the LLM training trajectories we will compute the actual loss after a single optimizer step on held-out batches and report its correlation with the second-order Taylor prediction, along with the relative contribution of higher-order terms. This addition will clarify the regime in which the approximation remains informative and thereby support the subsequent geometric interpretation via NDS. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external assumptions

full rationale

The paper's central chain begins with a second-order Taylor expansion applied to the loss (abstract, first analysis step), decomposes the curvature penalty into update norm and NDS (defined quantities), and then proves a comparison result for stylized quadratic problems with heterogeneous curvature. The quadratic proof is an exact derivation within its stated model and does not reduce to a fitted parameter or self-citation; NDS is computed from the Taylor term rather than reverse-engineered from performance gaps. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided derivation steps. The Taylor approximation's applicability to real LLM landscapes is a separate correctness question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The analysis rests on the Taylor approximation as a domain assumption and introduces NDS without external validation; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Second-order Taylor approximation suffices to compare one-step loss decreases between optimizers.
Invoked to establish Muon's larger loss decrease at matched validation loss.

invented entities (1)

Normalized Directional Sharpness (NDS) no independent evidence
purpose: Decompose curvature penalty into directional alignment component separate from update norm.
New quantity defined to explain Muon's advantage; no independent evidence outside the paper's derivations.

pith-pipeline@v0.9.1-grok · 5811 in / 1360 out tokens · 33734 ms · 2026-06-28T07:02:08.964393+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Muon as a Residual Connection
cs.LG 2026-07 unverdicted novelty 3.0

Muon is interpreted as an implicit residual connection that sacrifices local gradient fidelity to improve downstream layer usability in neural network training.

Reference graph

Works this paper leans on

146 extracted references · 78 canonical work pages · cited by 1 Pith paper · 25 internal anchors

[1]

arXiv preprint arXiv:2410.06205 , year=

Round and Round We Go! What makes Rotary Positional Encodings useful? , author=. arXiv preprint arXiv:2410.06205 , year=

work page arXiv
[10]

2024 , url =

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

2024
[11]

Advances in Neural Information Processing Systems , volume=

The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=
[12]

, author=

Neural networks and physical systems with emergent collective computational abilities. , author=. Proceedings of the national academy of sciences , volume=
[13]

IEEE transactions on computers , volume=

Correlation matrix memories , author=. IEEE transactions on computers , volume=. 2009 , publisher=

2009
[14]

International Conference on Machine Learning , pages=

Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[15]

European Conference on Computer Vision , pages=

Motion mamba: Efficient and long sequence motion generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[16]

Nature , volume=

Non-holographic associative memory , author=. Nature , volume=. 1969 , publisher=

1969
[17]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=

2024
[21]

High-dimensional Learning Dynamics 2025 , year=

On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data , author=. High-dimensional Learning Dynamics 2025 , year=

2025
[23]

arXiv e-prints , pages=

A note on the convergence of muon and further , author=. arXiv e-prints , pages=
[27]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[33]

Advances in neural information processing systems , volume=

Symbolic discovery of optimization algorithms , author=. Advances in neural information processing systems , volume=
[34]

Advances in Neural Information Processing Systems , volume=

Convergence of adam under relaxed assumptions , author=. Advances in Neural Information Processing Systems , volume=
[35]

Advances in neural information processing systems , volume=

Adam can converge without any modification on update rules , author=. Advances in neural information processing systems , volume=
[36]

Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=
[39]

Advances in neural information processing systems , volume=

Why transformers need adam: A hessian perspective , author=. Advances in neural information processing systems , volume=
[40]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
[42]

Zero-Shot Relation Extraction via Reading Comprehension

Zero-shot relation extraction via reading comprehension , author=. arXiv preprint arXiv:1706.04115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Mathematics , volume=

Survey of optimization algorithms in modern neural networks , author=. Mathematics , volume=. 2023 , publisher=

2023
[44]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[45]

Advances in Neural Information Processing Systems , volume=

High-dimensional asymptotics of feature learning: How one gradient step improves the representation , author=. Advances in Neural Information Processing Systems , volume=
[46]

arXiv preprint arXiv:2305.18270 , year=

How two-layer neural networks learn, one (giant) step at a time , author=. arXiv preprint arXiv:2305.18270 , year=

work page arXiv
[47]

arXiv preprint arXiv:2410.02355 , year=

Alphaedit: Null-space constrained knowledge editing for language models , author=. arXiv preprint arXiv:2410.02355 , year=

work page arXiv
[48]

Advances in Neural Information Processing Systems , volume=

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models , author=. Advances in Neural Information Processing Systems , volume=
[49]

Transformer Feed-Forward Layers Are Key-Value Memories , year =

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2012
[50]

Mass-Editing Memory in a Transformer

Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Advances in neural information processing systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in neural information processing systems , volume=
[52]

arXiv preprint arXiv:2104.08696 , year=

Knowledge neurons in pretrained transformers , author=. arXiv preprint arXiv:2104.08696 , year=

work page arXiv
[53]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
[54]

Organization of memory , volume=

Episodic and semantic memory , author=. Organization of memory , volume=. 1972 , publisher=

1972
[55]

arXiv preprint arXiv:2412.06538 , year=

Understanding factual recall in transformers via associative memories , author=. arXiv preprint arXiv:2412.06538 , year=

work page arXiv
[56]

arXiv preprint arXiv:2310.17813 , year=

A spectral condition for feature learning , author=. arXiv preprint arXiv:2310.17813 , year=

work page arXiv
[58]

arXiv preprint arXiv:2410.11474 , year=

How transformers implement induction heads: Approximation and optimization analysis , author=. arXiv preprint arXiv:2410.11474 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Birth of a transformer: A memory viewpoint , author=. Advances in Neural Information Processing Systems , volume=
[60]

Advances in Neural Information Processing Systems , volume=

The evolution of statistical induction heads: In-context learning markov chains , author=. Advances in Neural Information Processing Systems , volume=
[61]

arXiv preprint arXiv:2409.10559 , year=

Unveiling induction heads: Provable training dynamics and feature learning in transformers , author=. arXiv preprint arXiv:2409.10559 , year=

work page arXiv
[62]

arXiv preprint arXiv:2402.14735 , year=

How transformers learn causal structure with gradient descent , author=. arXiv preprint arXiv:2402.14735 , year=

work page arXiv
[63]

International Conference on Learning Representations , year=

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , author=. International Conference on Learning Representations , year=
[64]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

arXiv preprint arXiv:2404.05405 , year=

Physics of language models: Part 3.3, knowledge capacity scaling laws , author=. arXiv preprint arXiv:2404.05405 , year=

work page arXiv
[66]

Proceedings of the National Academy of Sciences , volume=

Singular value decomposition for genome-wide expression data processing and modeling , author=. Proceedings of the National Academy of Sciences , volume=. 2000 , publisher=

2000
[67]

2007 15th European signal processing conference , pages=

The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

2007
[68]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2407.07972 , year=

Deconstructing what makes a good optimizer for language models , author=. arXiv preprint arXiv:2407.07972 , year=

work page arXiv
[72]

arXiv preprint arXiv:2408.09632 , year=

Modegpt: Modular decomposition for large language model compression , author=. arXiv preprint arXiv:2408.09632 , year=

work page arXiv
[77]

IEE proceedings F (radar and signal processing) , volume=

Blind beamforming for non-Gaussian signals , author=. IEE proceedings F (radar and signal processing) , volume=. 1993 , organization=

1993
[78]

SIAM journal on matrix analysis and applications , volume=

Jacobi angles for simultaneous diagonalization , author=. SIAM journal on matrix analysis and applications , volume=. 1996 , publisher=

1996
[79]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[83]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[84]

Advances in neural information processing systems , volume=

Fast approximate natural gradient descent in a kronecker factored eigenbasis , author=. Advances in neural information processing systems , volume=
[85]

arXiv preprint arXiv:2406.17748 , year=

A New Perspective on Shampoo's Preconditioner , author=. arXiv preprint arXiv:2406.17748 , year=

work page arXiv
[86]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[99]

International conference on machine learning , pages=

Adafactor: Adaptive learning rates with sublinear memory cost , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[101]

Advances in neural information processing systems , volume=

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients , author=. Advances in neural information processing systems , volume=
[102]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[103]

arXiv preprint arXiv:2305.14342 , year=

Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

work page arXiv
[104]

Advances in neural information processing systems , volume=

Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=
[107]

International conference on machine learning , pages=

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[109]

Advances in Neural Information Processing Systems , volume=

Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization , author=. Advances in Neural Information Processing Systems , volume=
[111]

Linear algebra for large scale and real-time applications , pages=

Approximation with Kronecker products , author=. Linear algebra for large scale and real-time applications , pages=. 1993 , publisher=

1993
[115]

Advances in Neural Information Processing Systems , volume=

Sharpness-aware training for free , author=. Advances in Neural Information Processing Systems , volume=
[121]

Ahn, K. , Xu, B. , Abreu, N. , Fan, Y. , Magakyan, G. , Sharma, P. , Zhan, Z. and Langford, J. (2025). Dion: Distributed orthonormalized updates. arXiv preprint arXiv:2504.05295

work page arXiv 2025
[122]

, Liu, Y

An, K. , Liu, Y. , Pan, R. , Ren, Y. , Ma, S. , Goldfarb, D. and Zhang, T. (2025). Asgo: Adaptive structured gradient optimization. arXiv preprint arXiv:2503.20762

work page arXiv 2025
[123]

, Croce, F

Andriushchenko, M. , Croce, F. , M \"u ller, M. , Hein, M. and Flammarion, N. (2023). A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011

work page arXiv 2023
[124]

arXiv preprint arXiv:2002.09018 , year=

Anil, R. , Gupta, V. , Koren, T. , Regan, K. and Singer, Y. (2020). Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018

work page arXiv 2020
[125]

and Newhouse, L

Bernstein, J. and Newhouse, L. (2024 a ). Modular duality in deep learning. arXiv preprint arXiv:2410.21265

work page arXiv 2024
[126]

Old Optimizer, New Norm: An Anthology

Bernstein, J. and Newhouse, L. (2024 b ). Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325

work page internal anchor Pith review Pith/arXiv arXiv 2024
[127]

, Massena, T

Boissin, T. , Massena, T. , Mamalet, F. and Serrurier, M. (2025). Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning. arXiv preprint arXiv:2512.04632

work page arXiv 2025
[128]

and Souloumiac, A

Cardoso, J.-F. and Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. In IEE proceedings F (radar and signal processing), vol. 140. IET

1993
[129]

and Souloumiac, A

Cardoso, J.-F. and Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17 161--164

1996
[130]

Chen, L. , Li, J. and Liu, Q. (2025). Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054

work page arXiv 2025
[131]

, Liu, S

Chen, X. , Liu, S. , Sun, R. and Hong, M. (2019). On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations. ://openreview.net/forum?id=H1x-x309tm

2019
[132]

, Zang, J

Cheng, P. , Zang, J. , Li, Q. , Ma, L. , Cui, Y. , Zhang, Y. , Chen, B. , Jian, M. and Tong, W. (2026). Trasmuon: Trust-region adaptive scaling for orthogonalized momentum optimizers. arXiv preprint arXiv:2602.13498

work page arXiv 2026
[133]

Cohen, J. M. , Kaur, S. , Li, Y. , Kolter, J. Z. and Talwalkar, A. (2021). Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065

work page arXiv 2021
[134]

, Bottou, L

D \'e fossez, A. , Bottou, L. , Bach, F. and Usunier, N. (2020). A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395

work page arXiv 2020
[135]

, Pascanu, R

Dinh, L. , Pascanu, R. , Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning. PMLR

2017
[136]

, Zhang, Y

Dong, Z. , Zhang, Y. , Yao, J. and Sun, R. (2025). Towards quantifying the hessian structure of neural networks. arXiv preprint arXiv:2505.02809

work page arXiv 2025
[137]

, Yan, H

Du, J. , Yan, H. , Feng, J. , Zhou, J. T. , Zhen, L. , Goh, R. S. M. and Tan, V. Y. (2021). Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141

work page arXiv 2021

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2410.06205 , year=

Round and Round We Go! What makes Rotary Positional Encodings useful? , author=. arXiv preprint arXiv:2410.06205 , year=

work page arXiv

[2] [10]

2024 , url =

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

2024

[3] [11]

Advances in Neural Information Processing Systems , volume=

The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=

[4] [12]

, author=

Neural networks and physical systems with emergent collective computational abilities. , author=. Proceedings of the national academy of sciences , volume=

[5] [13]

IEEE transactions on computers , volume=

Correlation matrix memories , author=. IEEE transactions on computers , volume=. 2009 , publisher=

2009

[6] [14]

International Conference on Machine Learning , pages=

Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[7] [15]

European Conference on Computer Vision , pages=

Motion mamba: Efficient and long sequence motion generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[8] [16]

Nature , volume=

Non-holographic associative memory , author=. Nature , volume=. 1969 , publisher=

1969

[9] [17]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=

2024

[10] [21]

High-dimensional Learning Dynamics 2025 , year=

On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data , author=. High-dimensional Learning Dynamics 2025 , year=

2025

[11] [23]

arXiv e-prints , pages=

A note on the convergence of muon and further , author=. arXiv e-prints , pages=

[12] [27]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [29]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[14] [33]

Advances in neural information processing systems , volume=

Symbolic discovery of optimization algorithms , author=. Advances in neural information processing systems , volume=

[15] [34]

Advances in Neural Information Processing Systems , volume=

Convergence of adam under relaxed assumptions , author=. Advances in Neural Information Processing Systems , volume=

[16] [35]

Advances in neural information processing systems , volume=

Adam can converge without any modification on update rules , author=. Advances in neural information processing systems , volume=

[17] [36]

Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

[18] [39]

Advances in neural information processing systems , volume=

Why transformers need adam: A hessian perspective , author=. Advances in neural information processing systems , volume=

[19] [40]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=

[20] [42]

Zero-Shot Relation Extraction via Reading Comprehension

Zero-shot relation extraction via reading comprehension , author=. arXiv preprint arXiv:1706.04115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [43]

Mathematics , volume=

Survey of optimization algorithms in modern neural networks , author=. Mathematics , volume=. 2023 , publisher=

2023

[22] [44]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

[23] [45]

Advances in Neural Information Processing Systems , volume=

High-dimensional asymptotics of feature learning: How one gradient step improves the representation , author=. Advances in Neural Information Processing Systems , volume=

[24] [46]

arXiv preprint arXiv:2305.18270 , year=

How two-layer neural networks learn, one (giant) step at a time , author=. arXiv preprint arXiv:2305.18270 , year=

work page arXiv

[25] [47]

arXiv preprint arXiv:2410.02355 , year=

Alphaedit: Null-space constrained knowledge editing for language models , author=. arXiv preprint arXiv:2410.02355 , year=

work page arXiv

[26] [48]

Advances in Neural Information Processing Systems , volume=

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models , author=. Advances in Neural Information Processing Systems , volume=

[27] [49]

Transformer Feed-Forward Layers Are Key-Value Memories , year =

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2012

[28] [50]

Mass-Editing Memory in a Transformer

Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [51]

Advances in neural information processing systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in neural information processing systems , volume=

[30] [52]

arXiv preprint arXiv:2104.08696 , year=

Knowledge neurons in pretrained transformers , author=. arXiv preprint arXiv:2104.08696 , year=

work page arXiv

[31] [53]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

[32] [54]

Organization of memory , volume=

Episodic and semantic memory , author=. Organization of memory , volume=. 1972 , publisher=

1972

[33] [55]

arXiv preprint arXiv:2412.06538 , year=

Understanding factual recall in transformers via associative memories , author=. arXiv preprint arXiv:2412.06538 , year=

work page arXiv

[34] [56]

arXiv preprint arXiv:2310.17813 , year=

A spectral condition for feature learning , author=. arXiv preprint arXiv:2310.17813 , year=

work page arXiv

[35] [58]

arXiv preprint arXiv:2410.11474 , year=

How transformers implement induction heads: Approximation and optimization analysis , author=. arXiv preprint arXiv:2410.11474 , year=

work page arXiv

[36] [59]

Advances in Neural Information Processing Systems , volume=

Birth of a transformer: A memory viewpoint , author=. Advances in Neural Information Processing Systems , volume=

[37] [60]

Advances in Neural Information Processing Systems , volume=

The evolution of statistical induction heads: In-context learning markov chains , author=. Advances in Neural Information Processing Systems , volume=

[38] [61]

arXiv preprint arXiv:2409.10559 , year=

Unveiling induction heads: Provable training dynamics and feature learning in transformers , author=. arXiv preprint arXiv:2409.10559 , year=

work page arXiv

[39] [62]

arXiv preprint arXiv:2402.14735 , year=

How transformers learn causal structure with gradient descent , author=. arXiv preprint arXiv:2402.14735 , year=

work page arXiv

[40] [63]

International Conference on Learning Representations , year=

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , author=. International Conference on Learning Representations , year=

[41] [64]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [65]

arXiv preprint arXiv:2404.05405 , year=

Physics of language models: Part 3.3, knowledge capacity scaling laws , author=. arXiv preprint arXiv:2404.05405 , year=

work page arXiv

[43] [66]

Proceedings of the National Academy of Sciences , volume=

Singular value decomposition for genome-wide expression data processing and modeling , author=. Proceedings of the National Academy of Sciences , volume=. 2000 , publisher=

2000

[44] [67]

2007 15th European signal processing conference , pages=

The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

2007

[45] [68]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [70]

arXiv preprint arXiv:2407.07972 , year=

Deconstructing what makes a good optimizer for language models , author=. arXiv preprint arXiv:2407.07972 , year=

work page arXiv

[47] [72]

arXiv preprint arXiv:2408.09632 , year=

Modegpt: Modular decomposition for large language model compression , author=. arXiv preprint arXiv:2408.09632 , year=

work page arXiv

[48] [77]

IEE proceedings F (radar and signal processing) , volume=

Blind beamforming for non-Gaussian signals , author=. IEE proceedings F (radar and signal processing) , volume=. 1993 , organization=

1993

[49] [78]

SIAM journal on matrix analysis and applications , volume=

Jacobi angles for simultaneous diagonalization , author=. SIAM journal on matrix analysis and applications , volume=. 1996 , publisher=

1996

[50] [79]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017

[51] [83]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[52] [84]

Advances in neural information processing systems , volume=

Fast approximate natural gradient descent in a kronecker factored eigenbasis , author=. Advances in neural information processing systems , volume=

[53] [85]

arXiv preprint arXiv:2406.17748 , year=

A New Perspective on Shampoo's Preconditioner , author=. arXiv preprint arXiv:2406.17748 , year=

work page arXiv

[54] [86]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018

[55] [99]

International conference on machine learning , pages=

Adafactor: Adaptive learning rates with sublinear memory cost , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[56] [101]

Advances in neural information processing systems , volume=

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients , author=. Advances in neural information processing systems , volume=

[57] [102]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904

[58] [103]

arXiv preprint arXiv:2305.14342 , year=

Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

work page arXiv

[59] [104]

Advances in neural information processing systems , volume=

Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=

[60] [107]

International conference on machine learning , pages=

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[61] [109]

Advances in Neural Information Processing Systems , volume=

Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization , author=. Advances in Neural Information Processing Systems , volume=

[62] [111]

Linear algebra for large scale and real-time applications , pages=

Approximation with Kronecker products , author=. Linear algebra for large scale and real-time applications , pages=. 1993 , publisher=

1993

[63] [115]

Advances in Neural Information Processing Systems , volume=

Sharpness-aware training for free , author=. Advances in Neural Information Processing Systems , volume=

[64] [121]

Ahn, K. , Xu, B. , Abreu, N. , Fan, Y. , Magakyan, G. , Sharma, P. , Zhan, Z. and Langford, J. (2025). Dion: Distributed orthonormalized updates. arXiv preprint arXiv:2504.05295

work page arXiv 2025

[65] [122]

, Liu, Y

An, K. , Liu, Y. , Pan, R. , Ren, Y. , Ma, S. , Goldfarb, D. and Zhang, T. (2025). Asgo: Adaptive structured gradient optimization. arXiv preprint arXiv:2503.20762

work page arXiv 2025

[66] [123]

, Croce, F

Andriushchenko, M. , Croce, F. , M \"u ller, M. , Hein, M. and Flammarion, N. (2023). A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011

work page arXiv 2023

[67] [124]

arXiv preprint arXiv:2002.09018 , year=

Anil, R. , Gupta, V. , Koren, T. , Regan, K. and Singer, Y. (2020). Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018

work page arXiv 2020

[68] [125]

and Newhouse, L

Bernstein, J. and Newhouse, L. (2024 a ). Modular duality in deep learning. arXiv preprint arXiv:2410.21265

work page arXiv 2024

[69] [126]

Old Optimizer, New Norm: An Anthology

Bernstein, J. and Newhouse, L. (2024 b ). Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [127]

, Massena, T

Boissin, T. , Massena, T. , Mamalet, F. and Serrurier, M. (2025). Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning. arXiv preprint arXiv:2512.04632

work page arXiv 2025

[71] [128]

and Souloumiac, A

Cardoso, J.-F. and Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. In IEE proceedings F (radar and signal processing), vol. 140. IET

1993

[72] [129]

and Souloumiac, A

Cardoso, J.-F. and Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17 161--164

1996

[73] [130]

Chen, L. , Li, J. and Liu, Q. (2025). Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054

work page arXiv 2025

[74] [131]

, Liu, S

Chen, X. , Liu, S. , Sun, R. and Hong, M. (2019). On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations. ://openreview.net/forum?id=H1x-x309tm

2019

[75] [132]

, Zang, J

Cheng, P. , Zang, J. , Li, Q. , Ma, L. , Cui, Y. , Zhang, Y. , Chen, B. , Jian, M. and Tong, W. (2026). Trasmuon: Trust-region adaptive scaling for orthogonalized momentum optimizers. arXiv preprint arXiv:2602.13498

work page arXiv 2026

[76] [133]

Cohen, J. M. , Kaur, S. , Li, Y. , Kolter, J. Z. and Talwalkar, A. (2021). Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065

work page arXiv 2021

[77] [134]

, Bottou, L

D \'e fossez, A. , Bottou, L. , Bach, F. and Usunier, N. (2020). A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395

work page arXiv 2020

[78] [135]

, Pascanu, R

Dinh, L. , Pascanu, R. , Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning. PMLR

2017

[79] [136]

, Zhang, Y

Dong, Z. , Zhang, Y. , Yao, J. and Sun, R. (2025). Towards quantifying the hessian structure of neural networks. arXiv preprint arXiv:2505.02809

work page arXiv 2025

[80] [137]

, Yan, H

Du, J. , Yan, H. , Feng, J. , Zhou, J. T. , Zhen, L. , Goh, R. S. M. and Tan, V. Y. (2021). Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141

work page arXiv 2021