pith. sign in

arxiv: 2510.10777 · v3 · pith:ZW5XEE23new · submitted 2025-10-12 · 💻 cs.LG · math.OC

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Pith reviewed 2026-05-21 21:01 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords optimizationpreconditioned normsunified frameworkadaptive methodsquasi-Newtonsteepest descentinvariance
0
0 comments X

The pith

Preconditioned matrix norms provide a single framework in which steepest descent, quasi-Newton, and adaptive optimizers all appear as special cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a new abstraction called preconditioned matrix norms lets optimizers adapt to problem geometry through norm choices while also incorporating curvature information. This single principle recovers SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as instances rather than separate inventions. The framework supplies necessary and sufficient conditions for affine and scale invariance when parameters are matrices. Two new hybrids, MuAdam and MuAdam-SANIA, are derived by mixing spectral geometry with Adam-style preconditioning and are shown to compete with current methods on standard tasks.

Core claim

Preconditioned matrix norms generalize steepest descent by allowing arbitrary norm choices that adapt to different geometries, extend quasi-Newton and adaptive methods beyond the Frobenius inner product, and establish that SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus emerge directly as special cases of the same construction. Necessary and sufficient conditions for affine and scale invariance are derived under these generalized norms.

What carries the argument

Preconditioned matrix norms, which augment standard matrix norms with a preconditioning operator to encode both geometric adaptation and curvature utilization in a single object.

If this is right

  • Existing optimizers can be re-derived and compared inside one formalism instead of being developed in isolation.
  • Hybrid methods such as MuAdam arise systematically by selecting different combinations of norm and preconditioner.
  • Invariance properties for matrix-valued parameters can be checked or enforced by verifying the stated necessary and sufficient conditions.
  • New optimizers can be constructed by exploring other preconditioned norms that have not yet been instantiated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unification may allow automatic selection or interpolation between norms based on observed curvature or architecture type.
  • Similar preconditioned-norm constructions could be carried over to Riemannian or manifold-constrained optimization settings.
  • Convergence rates for the new MuAdam variants could be derived by specializing existing analyses of steepest descent under matrix norms.

Load-bearing premise

The chosen abstraction of preconditioned matrix norms is assumed to capture the essential geometry and curvature of the listed optimizers without the unification holding only by how the norms are defined.

What would settle it

An explicit derivation showing that Adam or Muon cannot be recovered from any choice of preconditioned matrix norm without extra structure that lies outside the framework.

Figures

Figures reproduced from arXiv: 2510.10777 by Aleksandr Beznosikov, Aleksandr Bogdanov, Andrey Veprikov, Arman Bolatov, Martin Tak\'a\v{c}, Samuel Horv\'ath, Slavomir Hanzely.

Figure 1
Figure 1. Figure 1: Scale invariance experiment (Mushrooms, LIBSVM) with a two-layer MLP. Training loss (left, log-scale) [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM fine-tuning results on Qwen2-7B: mean final accuracy with standard deviation across three seeds. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a unified optimization framework based on preconditioned matrix norms that generalizes steepest descent (via norm choice), quasi-Newton methods, and adaptive methods (via curvature incorporation). It claims that SGD (P = I), Adam (second-moment diagonal preconditioning), Muon (spectral norm), KL-Shampoo, SOAP, and SPlus all arise as special cases. The work derives necessary and sufficient conditions for affine and scale invariance under these generalized norms and proposes two new hybrids, MuAdam and MuAdam-SANIA, which combine Muon's spectral geometry with Adam-style preconditioning; experiments indicate these are competitive with or superior to existing methods on standard benchmarks. Code is provided for reproducibility.

Significance. A non-tautological unification that independently recovers existing update rules from a shared geometric principle, together with explicit invariance conditions and new competitive hybrids, would constitute a useful organizing framework for optimizer design. The systematic invariance analysis and experimental validation of the proposed MuAdam variants are potentially valuable contributions if the derivations hold without post-hoc embedding of each method's preconditioner.

major comments (2)
  1. [§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.
  2. [§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction should more clearly distinguish the novel contribution (preconditioned norms as a unifying principle) from the known fact that many optimizers can be viewed as preconditioned gradient steps.
  2. [Experiments] Experimental section: baseline comparisons should include recent hybrids such as SOAP and SPlus with identical hyperparameter tuning protocols to ensure the competitiveness claim for MuAdam variants is not due to tuning differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. These have helped us clarify the presentation of the preconditioned norms framework and its connections to existing methods. We address each major comment point by point below, with revisions made to strengthen the derivations and invariance analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.

    Authors: We agree that explicit derivation of each P from geometric principles is essential to substantiate the unification. In the revised manuscript, we have added a dedicated subsection (3.3) that derives the preconditioner choices from first principles. For Muon, the spectral norm is obtained by taking the operator norm induced by the Euclidean vector norm on the matrix space, so that minimizing ||ΔW||_P yields the update aligned with the dominant singular vector (scaled by the step size), recovering the exact Muon rule without post-hoc fitting. For Adam, the diagonal preconditioner P is motivated as a curvature approximation via the second-moment estimate of the gradient, which corresponds to a diagonal Hessian approximation; the closed-form minimizer of ||ΔW||_P is then the element-wise scaled update, matching Adam exactly. These derivations are presented with the full minimization steps and resulting closed forms for both methods, showing that the P selections follow directly from the desired geometry or curvature model rather than being reverse-engineered. revision: yes

  2. Referee: [§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.

    Authors: We thank the referee for highlighting this verification gap. In the revision, we have inserted a new table (Table 1) in §4 that enumerates each recovered optimizer (SGD, Adam, Muon, KL-Shampoo, SOAP, SPlus, and the proposed MuAdam variants), specifies the corresponding P, and indicates satisfaction of the affine and scale invariance conditions from Propositions 4.1 and 4.2. We also add a short corollary proving that the listed P choices satisfy the necessary and sufficient conditions under the problem assumptions already stated in the paper (e.g., Adam satisfies scale invariance but not full affine invariance, while Muon's spectral norm satisfies both when the matrix dimensions permit). No additional restrictions beyond those in the original framework are required, confirming broad applicability. revision: yes

Circularity Check

1 steps flagged

Unification holds by embedding existing preconditioners into the norm definition rather than deriving them independently

specific steps
  1. self definitional [Abstract]
    "we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle."

    The preconditioned-norm abstraction is introduced precisely so that each listed optimizer corresponds to a particular choice of the preconditioner matrix P inside the norm; the update rule for that optimizer is then recovered by construction when the minimization is performed with that P. Equivalence therefore follows from the definitional setup rather than from an a-priori geometric principle that would have predicted the preconditioners without prior knowledge of the optimizers.

full rationale

The paper defines a general steepest-descent update via minimization of a preconditioned matrix norm ||ΔW||_P and then shows that SGD, Adam, Muon, etc. arise for particular choices of P (identity, second-moment diagonal, spectral norm, etc.). Because the specific P for each optimizer is selected precisely to reproduce that optimizer's known update rule, the claimed 'emergence as special cases' reduces to a definitional reparameterization rather than an independent geometric derivation. The new methods MuAdam and MuAdam-SANIA are genuine extensions, but the central unification claim for the listed existing methods is load-bearing on this construction. No self-citation chain or uniqueness theorem is invoked to force the result, so the circularity is partial (score 6) rather than total.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central unification rests on the new norm definition and the claim that listed methods are special cases under it.

axioms (1)
  • domain assumption Preconditioned matrix norms form a valid and sufficiently general abstraction for the listed optimization families
    Invoked to establish that SGD, Adam, Muon and others emerge as special cases
invented entities (1)
  • preconditioned matrix norms no independent evidence
    purpose: To provide a single mathematical object that recovers steepest descent, quasi-Newton, and adaptive methods as instances
    New concept introduced to unify the methods

pith-pipeline@v0.9.0 · 5809 in / 1292 out tokens · 80819 ms · 2026-05-21T21:01:43.379336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  2. LionMuon: Alternating Spectral and Sign Descent for Efficient Training

    cs.LG 2026-05 unverdicted novelty 6.0

    LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

    Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Robert Gower, and Martin Tak´ aˇ c. Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,

  2. [2]

    The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

    Noah Amsel, David Persson, Christopher Musco, and Robert Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.ArXiv, abs/2505.16932,

  3. [3]

    and Newhouse, L

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024a. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024b. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex ...

  4. [4]

    Large-scale machine learning with stochastic gradient descent

    L´ eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer,

  5. [5]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

    12 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

  6. [6]

    Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

    Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,

  7. [7]

    An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

    Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

  8. [8]

    A stable whitening optimizer for efficient neural network training

    Kevin Frans, Sergey Levine, and Pieter Abbeel. A stable whitening optimizer for efficient neural network training. arXiv preprint arXiv:2506.07254,

  9. [9]

    Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

    Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,

  10. [10]

    Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

    Alejandro Hern´ andez-Cano, Alexander H¨ agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard FrankˇDurech, Ido Hakimi, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

  11. [11]

    Adam: A Method for Stochastic Optimization

    URLhttps://kellerjordan.github.io/posts/muon/. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  12. [12]

    Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

    Jiongcheng Li. Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,

  13. [13]

    Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

    Wu Lin, Scott Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, and Roger Grosse. Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,

  14. [14]

    Training Deep Learning Models with Norm-Constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

  15. [15]

    Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

    Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,

  16. [16]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

  17. [17]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  18. [18]

    A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

    Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, and Dongsheng Li. A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,

  19. [19]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  20. [20]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding

    15 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa la, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, ...

  21. [21]

    doi: 10.18653/v1/W18-5446

    Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. Rachel Ward. Stochastic gradient descent: where optimization meets machine learning. InProc. Int. Cong. Math, volume 7, pages 5140–5153,

  22. [22]

    Fantastic pretraining optimizers and where to find them

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046,

  23. [23]

    Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

    Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

  24. [24]

    LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

    Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,

  25. [25]

    doi: 10.18653/v1/P19-1472

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Thomas Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, and Nikolai Matni. On the concurrence of layer-wise preconditioning methods and provable feature learning.arXiv preprint arXiv:2502.01763,

  26. [26]

    If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces onΦ

    Throughout we fix an invertible matrixA∈R d×d and consider the re–parameterized loss for vectorized parameters Φ(wA) :=L A wA , w A :=A −1w, here we usedΦinstead of Lnew as in Section 3 in terms of convenience, we will do the similar change of notation in the Appendix B. If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces ...

  27. [27]

    Tuned values are selected to maximize validation accuracy, and final results are reported on the test split with the chosen configuration

    20 C Scale Invariance Setup and Hyperparameters To ensure a fair comparison across optimizers and input scalings, we perform hyperparameter tuning separately for each method and for both the original and scaled tasks using Optuna on a held-out validation split (see Section 4.1). Tuned values are selected to maximize validation accuracy, and final results ...