Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Aleksandr Beznosikov; Aleksandr Bogdanov; Andrey Veprikov; Arman Bolatov; Martin Tak\'a\v{c}; Samuel Horv\'ath; Slavomir Hanzely

arxiv: 2510.10777 · v3 · pith:ZW5XEE23new · submitted 2025-10-12 · 💻 cs.LG · math.OC

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Andrey Veprikov , Arman Bolatov , Aleksandr Bogdanov , Samuel Horv\'ath , Aleksandr Beznosikov , Martin Tak\'a\v{c} , Slavomir Hanzely This is my paper

Pith reviewed 2026-05-21 21:01 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords optimizationpreconditioned normsunified frameworkadaptive methodsquasi-Newtonsteepest descentinvariance

0 comments

The pith

Preconditioned matrix norms provide a single framework in which steepest descent, quasi-Newton, and adaptive optimizers all appear as special cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a new abstraction called preconditioned matrix norms lets optimizers adapt to problem geometry through norm choices while also incorporating curvature information. This single principle recovers SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as instances rather than separate inventions. The framework supplies necessary and sufficient conditions for affine and scale invariance when parameters are matrices. Two new hybrids, MuAdam and MuAdam-SANIA, are derived by mixing spectral geometry with Adam-style preconditioning and are shown to compete with current methods on standard tasks.

Core claim

Preconditioned matrix norms generalize steepest descent by allowing arbitrary norm choices that adapt to different geometries, extend quasi-Newton and adaptive methods beyond the Frobenius inner product, and establish that SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus emerge directly as special cases of the same construction. Necessary and sufficient conditions for affine and scale invariance are derived under these generalized norms.

What carries the argument

Preconditioned matrix norms, which augment standard matrix norms with a preconditioning operator to encode both geometric adaptation and curvature utilization in a single object.

If this is right

Existing optimizers can be re-derived and compared inside one formalism instead of being developed in isolation.
Hybrid methods such as MuAdam arise systematically by selecting different combinations of norm and preconditioner.
Invariance properties for matrix-valued parameters can be checked or enforced by verifying the stated necessary and sufficient conditions.
New optimizers can be constructed by exploring other preconditioned norms that have not yet been instantiated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification may allow automatic selection or interpolation between norms based on observed curvature or architecture type.
Similar preconditioned-norm constructions could be carried over to Riemannian or manifold-constrained optimization settings.
Convergence rates for the new MuAdam variants could be derived by specializing existing analyses of steepest descent under matrix norms.

Load-bearing premise

The chosen abstraction of preconditioned matrix norms is assumed to capture the essential geometry and curvature of the listed optimizers without the unification holding only by how the norms are defined.

What would settle it

An explicit derivation showing that Adam or Muon cannot be recovered from any choice of preconditioned matrix norm without extra structure that lies outside the framework.

Figures

Figures reproduced from arXiv: 2510.10777 by Aleksandr Beznosikov, Aleksandr Bogdanov, Andrey Veprikov, Arman Bolatov, Martin Tak\'a\v{c}, Samuel Horv\'ath, Slavomir Hanzely.

**Figure 2.** Figure 2: LLM fine-tuning results on Qwen2-7B: mean final accuracy with standard deviation across three seeds. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames many optimizers as special cases of minimizing a preconditioned matrix norm, which yields clean invariance conditions and two new hybrids, though the unification risks being partly by construction.

read the letter

The main takeaway is that this work treats steepest descent, quasi-Newton, and adaptive methods as instances of minimizing a preconditioned matrix norm ||ΔW||_P. Picking the right P recovers SGD (identity), Adam (second-moment diagonal), Muon (spectral), KL-Shampoo, SOAP, and SPlus as special cases. They also derive necessary and sufficient conditions for affine and scale invariance under these generalized norms, which appears to be the first systematic treatment for matrix parameters. From there they define MuAdam and MuAdam-SANIA, which blend Muon-style spectral geometry with Adam-style preconditioning, and report that the new variants are competitive with or beat existing methods on the tested tasks. Code is released, which is helpful for checking the claims directly. The invariance analysis is the clearest addition; it gives concrete conditions that could actually constrain future optimizer design rather than just catalog existing ones. The unification itself organizes the landscape neatly and makes the geometric trade-offs more explicit. The soft spot is that the framework can feel definitional. If the preconditioner is chosen precisely to reproduce each method's update, then the equivalence follows from the setup instead of revealing a deeper shared principle. The stress-test concern lands here: without seeing the derivations, it is not obvious whether the new methods arise from the norm principle on independent grounds or from post-hoc fitting. Experiments would also benefit from tighter controls on hyperparameter budgets and more varied architectures to show the hybrids are not just lucky on the reported runs. This is for researchers who design or analyze training algorithms, especially those interested in invariance or geometric views of curvature. A reader who wants a compact way to think about why certain preconditioners work across methods will get value. The paper shows clear thinking and honest engagement with the literature, so it deserves a serious referee even if the unification needs sharpening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a unified optimization framework based on preconditioned matrix norms that generalizes steepest descent (via norm choice), quasi-Newton methods, and adaptive methods (via curvature incorporation). It claims that SGD (P = I), Adam (second-moment diagonal preconditioning), Muon (spectral norm), KL-Shampoo, SOAP, and SPlus all arise as special cases. The work derives necessary and sufficient conditions for affine and scale invariance under these generalized norms and proposes two new hybrids, MuAdam and MuAdam-SANIA, which combine Muon's spectral geometry with Adam-style preconditioning; experiments indicate these are competitive with or superior to existing methods on standard benchmarks. Code is provided for reproducibility.

Significance. A non-tautological unification that independently recovers existing update rules from a shared geometric principle, together with explicit invariance conditions and new competitive hybrids, would constitute a useful organizing framework for optimizer design. The systematic invariance analysis and experimental validation of the proposed MuAdam variants are potentially valuable contributions if the derivations hold without post-hoc embedding of each method's preconditioner.

major comments (2)

[§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.
[§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.

minor comments (2)

[Abstract / §1] The abstract and introduction should more clearly distinguish the novel contribution (preconditioned norms as a unifying principle) from the known fact that many optimizers can be viewed as preconditioned gradient steps.
[Experiments] Experimental section: baseline comparisons should include recent hybrids such as SOAP and SPlus with identical hyperparameter tuning protocols to ensure the competitiveness claim for MuAdam variants is not due to tuning differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. These have helped us clarify the presentation of the preconditioned norms framework and its connections to existing methods. We address each major comment point by point below, with revisions made to strengthen the derivations and invariance analysis.

read point-by-point responses

Referee: [§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.

Authors: We agree that explicit derivation of each P from geometric principles is essential to substantiate the unification. In the revised manuscript, we have added a dedicated subsection (3.3) that derives the preconditioner choices from first principles. For Muon, the spectral norm is obtained by taking the operator norm induced by the Euclidean vector norm on the matrix space, so that minimizing ||ΔW||_P yields the update aligned with the dominant singular vector (scaled by the step size), recovering the exact Muon rule without post-hoc fitting. For Adam, the diagonal preconditioner P is motivated as a curvature approximation via the second-moment estimate of the gradient, which corresponds to a diagonal Hessian approximation; the closed-form minimizer of ||ΔW||_P is then the element-wise scaled update, matching Adam exactly. These derivations are presented with the full minimization steps and resulting closed forms for both methods, showing that the P selections follow directly from the desired geometry or curvature model rather than being reverse-engineered. revision: yes
Referee: [§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.

Authors: We thank the referee for highlighting this verification gap. In the revision, we have inserted a new table (Table 1) in §4 that enumerates each recovered optimizer (SGD, Adam, Muon, KL-Shampoo, SOAP, SPlus, and the proposed MuAdam variants), specifies the corresponding P, and indicates satisfaction of the affine and scale invariance conditions from Propositions 4.1 and 4.2. We also add a short corollary proving that the listed P choices satisfy the necessary and sufficient conditions under the problem assumptions already stated in the paper (e.g., Adam satisfies scale invariance but not full affine invariance, while Muon's spectral norm satisfies both when the matrix dimensions permit). No additional restrictions beyond those in the original framework are required, confirming broad applicability. revision: yes

Circularity Check

1 steps flagged

Unification holds by embedding existing preconditioners into the norm definition rather than deriving them independently

specific steps

self definitional [Abstract]
"we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle."

The preconditioned-norm abstraction is introduced precisely so that each listed optimizer corresponds to a particular choice of the preconditioner matrix P inside the norm; the update rule for that optimizer is then recovered by construction when the minimization is performed with that P. Equivalence therefore follows from the definitional setup rather than from an a-priori geometric principle that would have predicted the preconditioners without prior knowledge of the optimizers.

full rationale

The paper defines a general steepest-descent update via minimization of a preconditioned matrix norm ||ΔW||_P and then shows that SGD, Adam, Muon, etc. arise for particular choices of P (identity, second-moment diagonal, spectral norm, etc.). Because the specific P for each optimizer is selected precisely to reproduce that optimizer's known update rule, the claimed 'emergence as special cases' reduces to a definitional reparameterization rather than an independent geometric derivation. The new methods MuAdam and MuAdam-SANIA are genuine extensions, but the central unification claim for the listed existing methods is load-bearing on this construction. No self-citation chain or uniqueness theorem is invoked to force the result, so the circularity is partial (score 6) rather than total.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central unification rests on the new norm definition and the claim that listed methods are special cases under it.

axioms (1)

domain assumption Preconditioned matrix norms form a valid and sufficiently general abstraction for the listed optimization families
Invoked to establish that SGD, Adam, Muon and others emerge as special cases

invented entities (1)

preconditioned matrix norms no independent evidence
purpose: To provide a single mathematical object that recovers steepest descent, quasi-Newton, and adaptive methods as instances
New concept introduced to unify the methods

pith-pipeline@v0.9.0 · 5809 in / 1292 out tokens · 80819 ms · 2026-05-21T21:01:43.379336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms... SGD and Adam... Muon and KL-Shampoo... SOAP and SPlus... all emerge as special cases
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: lmoL,R,∥·∥(G) = L⁻¹ lmo∥·∥(L^{-T} G R^{-T}) R^{-1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Robert Gower, and Martin Tak´ aˇ c. Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

work page arXiv
[2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.ArXiv, abs/2505.16932,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

and Newhouse, L

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024a. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024b. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex ...

work page arXiv
[4]

Large-scale machine learning with stochastic gradient descent

L´ eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer,

work page 2010
[5]

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

12 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

work page 1901
[6]

Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

work page arXiv
[7]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

work page arXiv
[8]

A stable whitening optimizer for efficient neural network training

Kevin Frans, Sergey Levine, and Pieter Abbeel. A stable whitening optimizer for efficient neural network training. arXiv preprint arXiv:2506.07254,

work page arXiv
[9]

Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

work page arXiv
[10]

Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

Alejandro Hern´ andez-Cano, Alexander H¨ agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard FrankˇDurech, Ido Hakimi, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

work page arXiv
[11]

Adam: A Method for Stochastic Optimization

URLhttps://kellerjordan.github.io/posts/muon/. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

Jiongcheng Li. Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

work page arXiv
[13]

Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

Wu Lin, Scott Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, and Roger Grosse. Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

work page arXiv
[14]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

work page internal anchor Pith review arXiv
[15]

Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

work page arXiv
[16]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

work page arXiv
[17]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, and Dongsheng Li. A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

work page arXiv
[19]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

15 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa la, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, ...

work page 2018
[21]

doi: 10.18653/v1/W18-5446

Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. Rachel Ward. Stochastic gradient descent: where optimization meets machine learning. InProc. Int. Cong. Math, volume 7, pages 5140–5153,

work page doi:10.18653/v1/w18-5446
[22]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046,

work page arXiv
[23]

Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

work page arXiv
[24]

LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

work page arXiv
[25]

doi: 10.18653/v1/P19-1472

Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Thomas Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, and Nikolai Matni. On the concurrence of layer-wise preconditioning methods and provable feature learning.arXiv preprint arXiv:2502.01763,

work page doi:10.18653/v1/p19-1472
[26]

If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces onΦ

Throughout we fix an invertible matrixA∈R d×d and consider the re–parameterized loss for vectorized parameters Φ(wA) :=L A wA , w A :=A −1w, here we usedΦinstead of Lnew as in Section 3 in terms of convenience, we will do the similar change of notation in the Appendix B. If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces ...

work page 2018
[27]

Tuned values are selected to maximize validation accuracy, and final results are reported on the test split with the chosen configuration

20 C Scale Invariance Setup and Hyperparameters To ensure a fair comparison across optimizers and input scalings, we perform hyperparameter tuning separately for each method and for both the original and scaled tasks using Optuna on a held-out validation split (see Section 4.1). Tuned values are selected to maximize validation accuracy, and final results ...

work page 2014

[1] [1]

Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Robert Gower, and Martin Tak´ aˇ c. Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

work page arXiv

[2] [2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.ArXiv, abs/2505.16932,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

and Newhouse, L

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024a. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024b. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex ...

work page arXiv

[4] [4]

Large-scale machine learning with stochastic gradient descent

L´ eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer,

work page 2010

[5] [5]

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

12 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

work page 1901

[6] [6]

Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

work page arXiv

[7] [7]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

work page arXiv

[8] [8]

A stable whitening optimizer for efficient neural network training

Kevin Frans, Sergey Levine, and Pieter Abbeel. A stable whitening optimizer for efficient neural network training. arXiv preprint arXiv:2506.07254,

work page arXiv

[9] [9]

Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

work page arXiv

[10] [10]

Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

Alejandro Hern´ andez-Cano, Alexander H¨ agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard FrankˇDurech, Ido Hakimi, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

work page arXiv

[11] [11]

Adam: A Method for Stochastic Optimization

URLhttps://kellerjordan.github.io/posts/muon/. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

Jiongcheng Li. Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

work page arXiv

[13] [13]

Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

Wu Lin, Scott Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, and Roger Grosse. Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

work page arXiv

[14] [14]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

work page internal anchor Pith review arXiv

[15] [15]

Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

work page arXiv

[16] [16]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

work page arXiv

[17] [17]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, and Dongsheng Li. A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

work page arXiv

[19] [19]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

15 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa la, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, ...

work page 2018

[21] [21]

doi: 10.18653/v1/W18-5446

Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. Rachel Ward. Stochastic gradient descent: where optimization meets machine learning. InProc. Int. Cong. Math, volume 7, pages 5140–5153,

work page doi:10.18653/v1/w18-5446

[22] [22]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046,

work page arXiv

[23] [23]

Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

work page arXiv

[24] [24]

LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

work page arXiv

[25] [25]

doi: 10.18653/v1/P19-1472

Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Thomas Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, and Nikolai Matni. On the concurrence of layer-wise preconditioning methods and provable feature learning.arXiv preprint arXiv:2502.01763,

work page doi:10.18653/v1/p19-1472

[26] [26]

If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces onΦ

Throughout we fix an invertible matrixA∈R d×d and consider the re–parameterized loss for vectorized parameters Φ(wA) :=L A wA , w A :=A −1w, here we usedΦinstead of Lnew as in Section 3 in terms of convenience, we will do the similar change of notation in the Appendix B. If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces ...

work page 2018

[27] [27]

Tuned values are selected to maximize validation accuracy, and final results are reported on the test split with the chosen configuration

20 C Scale Invariance Setup and Hyperparameters To ensure a fair comparison across optimizers and input scalings, we perform hyperparameter tuning separately for each method and for both the original and scaled tasks using Optuna on a held-out validation split (see Section 4.1). Tuned values are selected to maximize validation accuracy, and final results ...

work page 2014