arxiv: 2605.11850 · v1 · submitted 2026-05-12 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

Konstantinos Oikonomidis , Jan Quan , Kimon Antonakopoulos , Antonio Silveti-Falls , Volkan Cevher , Panagiotis Patrinos

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:25 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords stochastic optimizationspectral preconditioningproximal gradient methodsnonconvex optimizationconstrained optimizationheavy-tailed noisevariance reduction

0 comments

The pith

Proximal spectral gradient methods converge for nonconvex constrained problems under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops proximal preconditioned stochastic gradient methods centered on spectral preconditioning as an extension of the Muon and Scion optimizers. These algorithms accommodate a broad class of convex and nonconvex constraints while providing convergence guarantees when stochastic gradients exhibit heavy-tailed noise. The analysis is built around the specific geometry induced by the preconditioner and proximal operator. A variance-reduced variant is presented that attains faster rates under conventional noise assumptions. The work also demonstrates that the polynomial iterations common in Muon are more accurately described by a nonlinear preconditioner than by the exact matrix sign function.

Core claim

We develop proximal preconditioned gradient methods with a focus on spectral gradient methods providing a proximal extension to the Muon and Scion optimizers. We introduce a family of stochastic algorithms that can handle a wide variety of convex and nonconvex constraints and study its convergence under heavy-tailed noise, through a novel analysis tailored to the geometry of the proposed methods. We further propose a variance-reduced version, which achieves faster convergence under standard noise assumptions. Finally, we show that the polynomial iterations used in Muon are more accurately captured by a nonlinear preconditioner than by the ideal matrix sign, leading to a convergence analysis.

What carries the argument

Proximal spectral preconditioning, which applies a spectral transformation to the gradient before taking a proximal step to enforce constraints while preserving the geometry needed for the convergence bounds.

If this is right

Stochastic optimization over nonconvex constraint sets now has provable convergence guarantees.
The methods remain stable when gradient estimates contain heavy tails, which occur in many practical sampling schemes.
Variance reduction yields strictly faster convergence rates while retaining the same proximal structure.
Practical polynomial implementations of spectral steps align with the nonlinear preconditioner model rather than the ideal sign function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric analysis may transfer to other preconditioned proximal methods that share similar update structures.
These algorithms could be tested on constrained deep-learning tasks such as training with hard architectural constraints.
Extensions to adaptive spectral preconditioners or different noise models remain open but follow the same geometric template.

Load-bearing premise

The geometry-tailored bounds hold only when the heavy-tailed noise and proximal mapping satisfy particular regularity conditions that close the analysis.

What would settle it

Running the algorithm on a simple nonconvex constrained problem with explicitly heavy-tailed stochastic gradients and observing failure to converge to a stationary point would disprove the claimed convergence result.

Figures

Figures reproduced from arXiv: 2605.11850 by Antonio Silveti-Falls, Jan Quan, Kimon Antonakopoulos, Konstantinos Oikonomidis, Panagiotis Patrinos, Volkan Cevher.

**Figure 2.** Figure 2: (a-b) Illustration of the nonlinear preconditioning induced by [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Demonstration of proximal spectral preconditioning with several different [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: The Polar Express fits more closely to ( [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗

**Figure 5.** Figure 5: Median number of epochs to 95% validation accuracy for training a transformer on [PITH_FULL_IMAGE:figures/full_fig_p039_5.png] view at source ↗

**Figure 6.** Figure 6: Test accuracy for CNN trained on the CIFAR10 dataset. The radii of the lmos [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison unconstrained Scion with proximal spectral preconditioning using a [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

read the original abstract

In this work, we develop proximal preconditioned gradient methods with a focus on spectral gradient methods providing a proximal extension to the Muon and Scion optimizers. We introduce a family of stochastic algorithms that can handle a wide variety of convex and nonconvex constraints and study its convergence under heavy-tailed noise, through a novel analysis tailored to the geometry of the proposed methods. We further propose a variance-reduced version, which achieves faster convergence under standard noise assumptions. Finally, we show that the polynomial iterations used in Muon are more accurately captured by a nonlinear preconditioner than by the ideal matrix sign, leading to a convergence analysis that more faithfully reflects practical implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends Muon-style spectral preconditioning to proximal methods for constrained nonconvex stochastic optimization under heavy-tailed noise, but the geometry-tailored analysis may require unstated restrictions on nonconvex sets.

read the letter

The main thing to know is that this work extends spectral preconditioning to proximal methods for nonconvex constrained stochastic optimization with heavy-tailed noise via a tailored analysis, but the nonconvex claims rest on geometry properties that may not hold generally. What is new is the proximal extension of Muon/Scion and the heavy-tailed convergence analysis adapted to the preconditioner geometry, along with a variance-reduced version and a better model for Muon iterations as nonlinear preconditioners. The paper does well in addressing practical constraints in optimization and providing rates under heavy tails, which is relevant for ML. Soft spots include potential hidden assumptions in the analysis for nonconvex sets, as the proximal mapping may not be non-expansive, which could affect the descent bounds under heavy-tailed noise. The abstract claims a novel tailored analysis, but without seeing explicit assumption lists or derivation steps, it's hard to confirm if it supports the broad claims. This paper is for researchers in stochastic optimization interested in preconditioned methods for constrained problems. A reader working on heavy-tailed noise or spectral methods would get value from the extensions. It deserves a serious referee to examine the proofs and assumptions in detail.

Referee Report

1 major / 2 minor

Summary. The manuscript develops proximal extensions of spectral preconditioned gradient methods (as proximal versions of Muon and Scion) for stochastic optimization. It introduces a family of algorithms that accommodate convex and nonconvex constraints, establishes convergence under heavy-tailed noise via a novel geometry-tailored analysis, proposes a variance-reduced variant achieving faster rates under standard noise, and argues that nonlinear preconditioners more faithfully model the polynomial iterations used in Muon than the matrix sign function.

Significance. If the stated convergence results hold with the claimed generality, the work would be a useful contribution to constrained nonconvex stochastic optimization, particularly for settings with heavy-tailed noise common in machine learning. The geometry-aware analysis and the practical observation on Muon modeling are potentially valuable; the variance-reduced extension is a standard but welcome addition. The overall significance is moderate and depends on whether the novel analysis closes without hidden restrictions on the constraint sets.

major comments (1)

[§4] §4 (Convergence Analysis under Heavy-Tailed Noise), the descent inequality leading to Theorem 4.1: the argument absorbs heavy-tail moments by treating the proximal step as approximately non-expansive (or using a local curvature control derived from the preconditioner geometry). This property does not hold for arbitrary nonconvex constraint sets, where the proximal mapping can expand distances locally (e.g., indicator functions of nonconvex varieties). The claim of applicability to a 'wide variety of ... nonconvex constraints' is therefore load-bearing and requires either an explicit additional assumption (local convexity or bounded prox-Lipschitz constant) or a revised bound that does not rely on non-expansiveness.

minor comments (2)

[Abstract] Abstract: the phrase 'novel analysis tailored to the geometry' is repeated without a one-sentence pointer to the key technical device (e.g., a specific inequality or Lyapunov function) that distinguishes it from standard prox-gradient analyses.
[Notation] Notation section: the definition of the spectral preconditioner (likely Eq. (3) or (5)) should explicitly state whether it is applied before or after the proximal mapping, as this affects the interpretation of the 'nonlinear preconditioner' claim for Muon.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment raises an important point about the scope of the convergence analysis for nonconvex constraints, which we address directly below. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis under Heavy-Tailed Noise), the descent inequality leading to Theorem 4.1: the argument absorbs heavy-tail moments by treating the proximal step as approximately non-expansive (or using a local curvature control derived from the preconditioner geometry). This property does not hold for arbitrary nonconvex constraint sets, where the proximal mapping can expand distances locally (e.g., indicator functions of nonconvex varieties). The claim of applicability to a 'wide variety of ... nonconvex constraints' is therefore load-bearing and requires either an explicit additional assumption (local convexity or bounded prox-Lipschitz constant) or a revised bound that does not rely on non-expansiveness.

Authors: We agree that the descent inequality in the proof of Theorem 4.1 relies on controlling the expansion of the proximal mapping, which the manuscript justifies through local curvature properties induced by the spectral preconditioner. However, this control is not automatic for completely arbitrary nonconvex sets, as the referee correctly notes. To resolve the issue without weakening the heavy-tailed noise result, we will introduce an explicit additional assumption (new Assumption 4.3 in the revised version) requiring that the proximal operator satisfies a bounded prox-Lipschitz constant relative to the preconditioner geometry. This assumption is satisfied by the constraint sets used in the numerical experiments and by many practical nonconvex constraints (e.g., those with bounded curvature or when the preconditioner dominates local expansion). We will also add a short discussion clarifying the class of constraints for which the assumption holds, update the statement of Theorem 4.1 to reference the new assumption, and revise the proof to make the dependence explicit. No change is needed to the variance-reduced analysis or the Muon modeling section. revision: yes

Circularity Check

0 steps flagged

No circularity: novel analysis and preconditioner claims remain independent of fitted inputs or self-referential definitions

full rationale

The provided abstract and context describe a family of proximal preconditioned gradient methods with a novel geometry-tailored convergence analysis under heavy-tailed noise, plus a variance-reduced variant and a comparison of polynomial iterations to nonlinear preconditioners. No equations, parameter fits, or derivation steps are exhibited that reduce a claimed prediction or uniqueness result to a definition or self-citation by construction. The 'tailored to the geometry' phrasing does not, on the given text, equate the analysis bounds to quantities defined solely via the preconditioner itself; the central claims retain independent content from the method proposal and external noise assumptions. This is the expected honest non-finding for an abstract-level description lacking load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted. The 'novel analysis' likely rests on unstated domain assumptions about noise moments and proximal geometry.

pith-pipeline@v0.9.0 · 5426 in / 1140 out tokens · 68214 ms · 2026-05-13T05:25:07.677464+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

proximal preconditioned gradient methods... nonlinear preconditioner ∇ϕ*... anisotropic proximal mapping... gap Dϕ*(∇f(xk),−e∇g(xk))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reference function ϕ... strongly convex... even... ϕ(0)=0... domϕ bounded

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 2 internal anchors

[1]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. “Dion: Distributed orthonormalized updates”. In:arXiv preprint arXiv:2504.05295(2025)

work page arXiv 2025
[2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

N. Amsel, D. Persson, C. Musco, and R. M. Gower. “The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm”. In:arXiv preprint arXiv:2505.16932(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications

H. H. Bauschke, J. Bolte, and M. Teboulle. “A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications”. In:Mathematics of Operations Research42.2 (2017), pp. 330–348

work page 2017
[4]

H. H. Bauschke and P. L. Combettes.Convex analysis and monotone operator theory in Hilbert spaces. Springer, 2017

work page 2017
[5]

Beck.First-order methods in optimization

A. Beck.First-order methods in optimization. SIAM, 2017

work page 2017
[6]

signSGD: Com- pressed optimisation for non-convex problems

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. “signSGD: Com- pressed optimisation for non-convex problems”. In:International Conference on Machine Learning. 2018, pp. 560–569

work page 2018
[7]

Bhatia.Matrix Analysis

R. Bhatia.Matrix Analysis. Springer Science & Business Media, 1997

work page 1997
[8]

Escaping saddle points without Lipschitz smoothness: the power of nonlinear preconditioning

A. Bodard and P. Patrinos. “Escaping saddle points without Lipschitz smoothness: the power of nonlinear preconditioning”. In:Advances in Neural Information Processing Systems. 2025, pp. 124173–124208

work page 2025
[9]

Preconditioned Spectral Descent for Deep Learning

D. Carlson, E. Collins, Y. -P. Hsieh, L. Carin, and V. Cevher. “Preconditioned Spectral Descent for Deep Learning”. In:Advances in Neural Information Processing Systems. 2015

work page 2015
[10]

F. L. Cesista, Y. Jiacheng, and K. Jordan.Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz coefficients. 2025.url: https://leloykun.github. io/ponder/muon-opt-coeffs/

work page 2025
[11]

Muon Optimizes Under Spectral Norm Constraints

L. Chen, J. Li, and Q. Liu. “Muon Optimizes Under Spectral Norm Constraints”. In: Transactions on Machine Learning Research(2026).issn: 2835-8856

work page 2026
[12]

Lion secretly solves constrained optimization: As Lyapunov predicts

L. Chen, B. Liu, K. Liang, and Q. Liu. “Lion secretly solves constrained optimization: As Lyapunov predicts”. In:arXiv preprint arXiv:2310.05898(2023)

work page arXiv 2023
[13]

Symbolic Discovery of Optimization Algorithms

X. Chen et al. “Symbolic Discovery of Optimization Algorithms”. In:Advances in Neural Information Processing Systems. 2023, pp. 49205–49233

work page 2023
[14]

Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization

Z. Chen, Y. Zhou, Y. Liang, and Z. Lu. “Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization”. In:International Conference on Machine Learning. 2023, pp. 5396–5427

work page 2023
[15]

Moreau’s decomposition in Banach spaces

P. L. Combettes and N. N. Reyes. “Moreau’s decomposition in Banach spaces”. In: Mathematical Programming139.1 (2013), pp. 103–114

work page 2013
[16]

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

A. Cutkosky and H. Mehta. “High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails”. In:Advances in Neural Information Processing Systems. 2021, pp. 4883–4895

work page 2021
[17]

Momentum improves normalized SGD

A. Cutkosky and H. Mehta. “Momentum improves normalized SGD”. In:International Conference on Machine Learning. 2020, pp. 2260–2268. 14

work page 2020
[18]

Momentum-Based Variance Reduction in Non-Convex SGD

A. Cutkosky and F. Orabona. “Momentum-Based Variance Reduction in Non-Convex SGD”. In:Advances in Neural Information Processing Systems. 2019

work page 2019
[19]

On Φ-Convexity in Extremal Problems

S. Dolecki and S. Kurcyusz. “On Φ-Convexity in Extremal Problems”. In:SIAM Journal on Control and Optimization16.2 (1978), pp. 277–300

work page 1978
[20]

Can SGD Handle Heavy-Tailed Noise?

I. Fatkhullin, F. H¨ ubler, and G. Lan. “Can SGD Handle Heavy-Tailed Noise?” In:arXiv preprint arXiv:2508.04860(2025)

work page arXiv 2025
[21]

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

E. Gorbunov, A. Sadiev, M. Danilova, S. Horv´ ath, G. Gidel, P. Dvurechensky, A. Gas- nikov, and P. Richt´ arik. “High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise”. In: International Conference on Machine Learning. 2024, pp. 15951–16070

work page 2024
[22]

Shampoo: Preconditioned stochastic tensor opti- mization

V. Gupta, T. Koren, and Y. Singer. “Shampoo: Preconditioned stochastic tensor opti- mization”. In:International Conference on Machine Learning. 2018, pp. 1842–1850

work page 2018
[23]

R. A. Horn and C. R. Johnson.Topics in matrix analysis. Cambridge university press, 1994

work page 1994
[24]

From Gradient Clipping to Normalization for Heavy Tailed SGD

F. H¨ ubler, I. Fatkhullin, and N. He. “From Gradient Clipping to Normalization for Heavy Tailed SGD”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025, pp. 2413–2421

work page 2025
[25]

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise

D. Jakoveti´ c, D. Bajovi´ c, A. K. Sahu, S. Kar, N. Miloˇ sevi´ c, and D. Stamenkovi´ c. “Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise”. In:SIAM Journal on Optimization33.2 (2023), pp. 394– 423

work page 2023
[26]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein.Muon: An optimizer for hidden layers in neural networks. 2024.url: https://kellerjordan. github.io/posts/muon/

work page 2024
[27]

Analyzing and improving the training dynamics of diffusion models

T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine. “Analyzing and improving the training dynamics of diffusion models”. In:Conference on Computer Vision and Pattern Recognition. 2024, pp. 24174–24184

work page 2024
[28]

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

A. Koloskova, H. Hendrikx, and S. U. Stich. “Revisiting gradient clipping: Stochastic bias and tight convergence guarantees”. In:International Conference on Machine Learning. 2023, pp. 17343–17363

work page 2023
[29]

Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under ( L0, L1)-Smoothness

N. Kornilov, P. Zmushko, A. Semenov, M. Ikonnikov, A. Gasnikov, and A. Beznosikov. “Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under ( L0, L1)-Smoothness”. In:arXiv preprint arXiv:2502.07923 (2025)

work page arXiv 2025
[30]

Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645,

D. Kovalev. “Understanding gradient orthogonalization for deep learning via non- Euclidean trust-region optimization”. In:arXiv preprint arXiv:2503.12645(2025)

work page arXiv 2025
[31]

Krizhevsky et al.Learning multiple layers of features from tiny images

A. Krizhevsky et al.Learning multiple layers of features from tiny images. 2009

work page 2009
[32]

Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

T. T.-K. Lau, Q. Long, and W. Su. “PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective”. In:arXiv preprint arXiv:2505.21799 (2025)

work page arXiv 2025
[33]

Lower envelopes and lifting for structured nonconvex optimization

E. Laude. “Lower envelopes and lifting for structured nonconvex optimization”. PhD thesis. Technical University of Munich, 2021. 15

work page 2021
[34]

Anisotropic proximal gradient

E. Laude and P. Patrinos. “Anisotropic proximal gradient”. In:Mathematical Program- ming214 (2025), pp. 801–845

work page 2025
[35]

Dualities for non-Euclidean smoothness and strong convexity under the light of generalized conjugacy

E. Laude, A. Themelis, and P. Patrinos. “Dualities for non-Euclidean smoothness and strong convexity under the light of generalized conjugacy”. In:SIAM Journal on Optimization33.4 (2023), pp. 2721–2749

work page 2023
[36]

Optimization of inf-convolution regularized nonconvex composite problems

E. Laude, T. Wu, and D. Cremers. “Optimization of inf-convolution regularized nonconvex composite problems”. In:The 22nd International Conference on Artificial Intelligence and Statistics. 2019, pp. 547–556

work page 2019
[37]

The convex analysis of unitarily invariant matrix functions

A. S. Lewis. “The convex analysis of unitarily invariant matrix functions”. In:Journal of Convex Analysis2.1 (1995), pp. 173–183

work page 1995
[38]

Convex and Non-convex Optimization Under Generalized Smoothness

H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. “Convex and Non-convex Optimization Under Generalized Smoothness”. In:Advances in Neural Information Processing Systems. 2023, pp. 40238–40271

work page 2023
[39]

Preconditioned stochastic gradient descent

X.-L. Li. “Preconditioned stochastic gradient descent”. In:IEEE transactions on neural networks and learning systems29.5 (2017), pp. 1454–1466

work page 2017
[40]

Communication Efficient Distributed Training with Distributed Lion

B. Liu, L. Wu, L. Chen, K. Liang, J. Zhu, C. Liang, R. Krishnamoorthi, and Q. Liu. “Communication Efficient Distributed Training with Distributed Lion”. In:Advances in Neural Information Processing Systems. 2024, pp. 18388–18415

work page 2024
[41]

Mars-m: When variance reduction meets matrices

Y. Liu, A. Yuan, and Q. Gu. “MARS-M: When variance reduction meets matrices”. In: arXiv preprint arXiv:2510.21800(2025)

work page arXiv 2025
[42]

Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

Z. Liu and Z. Zhou. “Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping”. In:arXiv preprint arXiv:2412.19529 (2024)

work page arXiv 2024
[43]

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

I. Loshchilov, C.-P. Hsieh, S. Sun, and B. Ginsburg. “nGPT: Normalized transformer with representation learning on the hypersphere”. In:arXiv preprint arXiv:2410.01131 (2024)

work page arXiv 2024
[44]

Dual space preconditioning for gradient descent

C. J. Maddison, D. Paulin, Y. W. Teh, and A. Doucet. “Dual space preconditioning for gradient descent”. In:SIAM Journal on Optimization31.1 (2021), pp. 991–1016

work page 2021
[45]

Spectral Normalization for Generative Adversarial Networks

T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. “Spectral normalization for generative adversarial networks”. In:arXiv preprint arXiv:1802.05957(2018)

work page Pith review arXiv 2018
[46]

B. S. Mordukhovich.Variational Analysis and Generalized Differentiation I: Basic Theory. Vol. 330. Springer, 2006

work page 2006
[47]

Nesterov.Lectures on Convex Optimization

Y. Nesterov.Lectures on Convex Optimization. Springer, 2018

work page 2018
[48]

P., Cesista, F., Zahorodnii, A., Bernstein, J., and Isola, P

L. Newhouse, R. P. Hess, F. Cesista, A. Zahorodnii, J. Bernstein, and P. Isola. “Training transformers with enforced Lipschitz constants”. In:arXiv preprint arXiv:2507.13338 (2025)

work page arXiv 2025
[49]

Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness

K. Oikonomidis, J. Quan, E. Laude, and P. Patrinos. “Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness”. In:International Conference on Machine Learning. 2025, pp. 47132–47154

work page 2025
[50]

Nonlinearly Preconditioned Gradient Meth- ods: Momentum and Stochastic Analysis

K. Oikonomidis, J. Quan, and P. Patrinos. “Nonlinearly Preconditioned Gradient Meth- ods: Momentum and Stochastic Analysis”. In:Advances in Neural Information Processing Systems. 2025, pp. 38957–38988. 16

work page 2025
[51]

Training Deep Learning Models with Norm-Constrained LMOs

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. “Training Deep Learning Models with Norm-Constrained LMOs”. In:International Conference on Machine Learning. 2025, pp. 49069–49104

work page 2025
[52]

Generalized Gradient Norm Clipping & Non-Euclidean ( L0, L1)-Smoothness

T. Pethick, W. Xie, M. Erdogan, K. Antonakopoulos, A. Silveti-Falls, and V. Cevher. “Generalized Gradient Norm Clipping & Non-Euclidean ( L0, L1)-Smoothness”. In:Ad- vances in Neural Information Processing Systems. 2025, pp. 21170–21208

work page 2025
[53]

Multidimensional probability inequalities via spherical symmetry

I. Pinelis. “Multidimensional probability inequalities via spherical symmetry”. In:arXiv preprint arXiv:2210.04391(2022)

work page arXiv 2022
[54]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. “Grokking: Generalization beyond overfitting on small algorithmic datasets”. In:arXiv preprint arXiv:2201.02177 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

X. Qian, H. Rammal, D. Kovalev, and P. Richtarik. “Muon is Provably Faster with Momentum Variance Reduction”. In:arXiv preprint arXiv:2512.16598(2025)

work page arXiv 2025
[56]

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richt´ arik. “Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)”. In:arXiv preprint arXiv:2505.13416(2025)

work page arXiv 2025
[57]

R. T. Rockafellar.Convex Analysis. Princeton University Press, 1970

work page 1970
[58]

R. T. Rockafellar and R. J. Wets.Variational Analysis. New York: Springer, 1998

work page 1998
[59]

Lions and muons: Optimization via stochastic frank- wolfe.arXiv preprint arXiv:2506.04192, 2025

M.-E. Sfyraki and J. -K. Wang. “Lions and Muons: Optimization via stochastic Frank– Wolfe”. In:arXiv preprint arXiv:2506.04192(2025)

work page arXiv 2025
[60]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

work page 2014
[61]

Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

E. Shulgin, S. AlRashed, F. Orabona, and P. Richt´ arik. “Beyond the ideal: Analyzing the inexact Muon update”. In:arXiv preprint arXiv:2510.19933(2025)

work page arXiv 2025
[62]

Entropic proximal mappings with applications to nonlinear programming

M. Teboulle. “Entropic proximal mappings with applications to nonlinear programming”. In:Mathematics of Operations Research17.3 (1992), pp. 670–690

work page 1992
[63]

Toward a Unified Theory of Gradient Descent under Generalized Smooth- ness

A. Tyurin. “Toward a Unified Theory of Gradient Descent under Generalized Smooth- ness”. In:International Conference on Machine Learning. 2025, pp. 60493–60514

work page 2025
[64]

Optimizing ( L0, L1)- Smooth Functions by Gradient Methods

D. Vankov, A. Rodomanov, A. Nedich, L. Sankar, and S. U. Stich. “Optimizing ( L0, L1)- Smooth Functions by Gradient Methods”. In:arXiv preprint arXiv:2410.10800(2024)

work page arXiv 2024
[65]

Villani.Optimal Transport: Old and New

C. Villani.Optimal Transport: Old and New. Springer, 2009

work page 2009
[66]

Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions

B. Wang, H. Zhang, Z. Ma, and W. Chen. “Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions”. In:Conference on Learning Theory. 2023, pp. 161–190

work page 2023
[67]

K. Wen, X. Dang, K. Lyu, T. Ma, and P. Liang.Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization. 2025.url: https://tinyurl.com/ muonh

work page 2025
[68]

Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026

T. Xie et al. “Controlled LLM Training on Spectral Sphere”. In:arXiv preprint arXiv:2601.08393(2026)

work page arXiv 2026
[69]

MARS: Unleashing the Power of Variance Reduction for Training Large Models

H. Yuan, Y. Liu, S. Wu, Z. Xun, and Q. Gu. “MARS: Unleashing the Power of Variance Reduction for Training Large Models”. In:International Conference on Machine Learning. 2025, pp. 73553–73587. 17

work page 2025
[70]

Improved Analysis of Clipping Algorithms for Non-convex Optimization

B. Zhang, J. Jin, C. Fang, and L. Wang. “Improved Analysis of Clipping Algorithms for Non-convex Optimization”. In:Advances in Neural Information Processing Systems. 2020, pp. 15511–15521

work page 2020
[71]

Why gradient clipping accelerates training: A theoretical justification for adaptivity

J. Zhang, T. He, S. Sra, and A. Jadbabaie. “Why gradient clipping accelerates training: A theoretical justification for adaptivity”. In:arXiv preprint arXiv:1905.11881(2019)

work page arXiv 1905
[72]

AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates

M. Zhang, Y. Liu, and H. Schaeffer. “AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates”. In:arXiv preprint arXiv:2509.02981(2025). 18 A Preliminaries and helper results Lipschitz smoothness of a function f : E→R with constant L > 0 means that f is continuously differentiable and satisfies ∥∇f(x)− ∇f(¯x)∥ ≤L∥x−¯x∥ ∀x,¯x∈E. Lipschitz smoothness ...

work page arXiv 2025
[73]

kX i=1 (1−α) k−i(∇f(x i, ξi)− ∇f(x i)) # ≤E

◦σ , [37, Corollary 2.6] ensures that Φ − µ 2 ∥ · ∥2 F is convex, and using the reverse direction of [5, Theorem 5.17] yields the desired result. That Φ is lsc also follows from [37, Corollary 2.6]. Second, since Φ = ϕ◦σ is real orthogonal invariant from [37, Proposition 2.2], we have that Φ( −X) = Φ(( −Im)XI n) = Φ( X) for all X∈R m×n, thus Φ is even. Fu...

work page 2000