arxiv: 2605.08980 · v1 · submitted 2026-05-09 · 💻 cs.LG · math.OC· stat.ML

Recognition: no theorem link

Muon Does Not Converge on Convex Lipschitz Functions

Tetiana Parshakova , Ahmed Khaled , Michael Crawshaw , Guillaume Garrigos , Robert M. Gower

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:13 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords Muon optimizerconvex Lipschitz functionsconvergence analysiserror feedbackdeep learning optimizers

0 comments

The pith

Muon does not converge on convex Lipschitz functions regardless of learning rate schedule

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon fails to converge for convex and Lipschitz continuous functions, and this non-convergence holds for every possible learning rate schedule. This finding questions whether the convex Lipschitz class, successful for other optimizers, can explain Muon's strong performance in deep learning. The authors further show that adding error feedback makes Muon converge in theory but harms its results on real tasks like CIFAR-10 classification and language modeling. They conclude that Muon must exploit structure like smoothness that the convex Lipschitz model lacks.

Core claim

Muon and its variants do not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. Error feedback can restore convergence for Muon and all non-Euclidean subgradient methods with momentum, yet this fix degrades performance in practice on image classification and language modeling tasks.

What carries the argument

The construction of a convex Lipschitz function on which Muon iterates diverge from the optimum for arbitrary step sizes, together with the error-feedback correction that ensures convergence.

Load-bearing premise

The assumption that the convex Lipschitz class without smoothness is a relevant test case for whether Muon can be understood theoretically.

What would settle it

Finding even one convex Lipschitz function and one learning rate schedule where Muon converges to the solution.

Figures

Figures reproduced from arXiv: 2605.08980 by Ahmed Khaled, Guillaume Garrigos, Michael Crawshaw, Robert M. Gower, Tetiana Parshakova.

**Figure 1.** Figure 1: for an illustration of how the algorithm gets stuck, and see Section B.4 for the proofs. Counterexample 3.3 (Muon cannot converge: offline stepsizes). Let β ∈ [0, 1) and let {λt} ∞ t=0 be a nonincreasing positive sequence chosen offline. Let f be as in Definition 3.1 with c = 1−β 2 , and consider the iterates Wt of (Muon) applied to f. Let W⋆ be any minimizer of f. • If limt→∞ λt > 0, then there exists an … view at source ↗

**Figure 2.** Figure 2: Validation CE loss of vanilla Muon, MuonMax and EF-MuonMax. Left: [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Level sets of f from Definition 3.1. Problem instance. Consider minimizing function from Definition 3.1 with c = 1 − β 2(1 + β) , β = 0.9. We choose the initial point satisfying (W0)1,1 = 1 + log 2, (W0)2,2 = 1 − log 2, so that the conditions in Counterexample 3.3 hold. We set learning rate in (Muon) as λt = 1 t+1 , and λt = √ 1 t+1 in Algorithm 1 for all t = 0, . . . , T − 1. We run the methods for T = 50… view at source ↗

**Figure 4.** Figure 4: illustrates this behavior for the choices above. The iterates satisfy the invariant (Wt)1,1 + (Wt)2,2 = 2, so they cycle along a line in the ((Wt)1,1,(Wt)2,2) plane. The resulting trajectory is shown in [PITH_FULL_IMAGE:figures/full_fig_p050_4.png] view at source ↗

**Figure 5.** Figure 5: Algorithm 1 on the counterexample: (left) objective value f(Wt); (right) invariant (Wt)1,1 + (Wt)2,2. 0 1000 2000 3000 4000 5000 Iteration t 10−7 10−5 10−3 10−1 f(Wt) 0 1000 2000 3000 4000 5000 Iteration t 0.0 0.5 1.0 1.5 2.0 (Wt)1,1 + ( Wt)2,2 [PITH_FULL_IMAGE:figures/full_fig_p051_5.png] view at source ↗

**Figure 6.** Figure 6: Algorithm 1 without momentum (β = 0) with λt = √ 1 t+1 on the counterexample: (left) objective value f(Wt); (right) (Wt)1,1 + (Wt)2,2. F Numerical experiments We compare performance of MuonMax Algorithm 3 and EF-MuonMax Algorithm 4 on image classification and language modeling tasks. This section complements Section 5. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_6.png] view at source ↗

**Figure 7.** Figure 7: Algorithm 1 without momentum (β = 0) with λt = √ 1 t+1 trajectory in the ((Wt)1,1,(Wt)2,2) plane on the counterexample. Contours are level sets of f; dot color indicates the iteration t. Parameter partition. We split all parameters into two groups: matrix parameters and paired parameters. Matrix parameters are weight matrices, excluding embeddings. These are updated by vanilla Muon, MuonMax or EF-MuonMax. … view at source ↗

read the original abstract

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon fails to converge on convex Lipschitz functions no matter the learning rate schedule, with error feedback fixing theory but hurting practice.

read the letter

The punchline is that Muon has a non-convergence issue on the convex Lipschitz class that no choice of learning rate schedule can overcome, setting it apart from other methods, while error feedback fixes the theory but worsens results on image classification and language modeling tasks. This is new because previous work on Muon and similar optimizers leaned on smoothness to prove convergence, but this paper demonstrates failure in the weaker convex Lipschitz setting that has been key for methods like AdaGrad. They construct a specific counterexample function where the Muon iterates fail to approach the optimum. The paper does well by also exploring error feedback as a remedy that works for Muon and other non-Euclidean subgradient methods with momentum, backed by experiments showing performance drops on CIFAR-10 and nanoGPT. The soft spots are minor. The central claim about any learning rate schedule is strong, and the stress-test concern about fully adaptive rules based on gradient history does not seem to undermine it, as the counterexample likely accounts for bounded gradients allowing safe but still failing steps. The empirical section is limited to two representative settings, which is fine but could be expanded. The math appears solid and direct, without circularity. This work is aimed at optimization researchers in machine learning who rely on convex Lipschitz models to analyze deep learning optimizers. A reader interested in the limitations of current theoretical frameworks for momentum-based methods will find value here. It deserves a serious referee because it delivers a clear negative result with supporting experiments and challenges an established modeling choice. I recommend sending this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that Muon fails to converge on convex Lipschitz functions for any learning-rate schedule, via a direct counterexample construction. It further shows that error feedback restores convergence for Muon and other non-Euclidean subgradient methods with momentum, but that this modification degrades empirical performance on CIFAR-10 image classification and nanoGPT language modeling on FineWeb-Edu. The conclusion is that the convex Lipschitz model is unsuitable for explaining Muon's success, which must rely on smoothness or other structure absent from the model.

Significance. If the non-convergence result holds, the work is significant because it provides a concrete counterexample separating Muon from the convex Lipschitz setting that has been central to analyses of AdaGrad, Shampoo, and schedule-free methods. The direct (non-circular) nature of the counterexample and the reproducible empirical degradation with error feedback are strengths. The result usefully redirects theoretical attention toward smoothness assumptions that may better capture where Muon succeeds in practice.

major comments (2)

[main theorem / counterexample construction] The central claim (abstract and main theorem) asserts non-convergence 'regardless of the choice of learning rate schedule.' The counterexample must be shown to defeat every adaptive rule in which the step size at iteration t is an arbitrary positive function of the observed gradient sequence and prior iterates. Because the constructed function is convex and Lipschitz, gradient norms are bounded; it is not immediate that no history-dependent rule can select safe steps that avoid the exhibited oscillation. The proof should explicitly rule out such rules or state the precise class of schedules covered.
[error feedback section] The error-feedback convergence restoration (section on error feedback) is stated for 'all the non-Euclidean subgradient methods with momentum.' The argument should clarify whether the same step-size restrictions or additional assumptions (e.g., on the momentum parameter or the geometry) are required for each method, and whether the restored convergence rate is comparable to standard subgradient methods.

minor comments (2)

[preliminaries] Notation for the Muon update (momentum and normalization steps) should be stated once in a single display equation early in the paper to avoid repeated inline definitions.
[experiments] The experimental section would benefit from reporting the precise error-feedback implementation (e.g., the scaling factor or accumulation rule) used in the CIFAR-10 and nanoGPT runs so that the degradation can be reproduced exactly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and insightful referee report. We appreciate the opportunity to clarify the scope of our results and will revise the manuscript to address the points raised.

read point-by-point responses

Referee: [main theorem / counterexample construction] The central claim (abstract and main theorem) asserts non-convergence 'regardless of the choice of learning rate schedule.' The counterexample must be shown to defeat every adaptive rule in which the step size at iteration t is an arbitrary positive function of the observed gradient sequence and prior iterates. Because the constructed function is convex and Lipschitz, gradient norms are bounded; it is not immediate that no history-dependent rule can select safe steps that avoid the exhibited oscillation. The proof should explicitly rule out such rules or state the precise class of schedules covered.

Authors: Our main theorem establishes that on the constructed convex Lipschitz function, the Muon iterates fail to converge for any sequence of positive learning rates. Any adaptive schedule that selects the step size as a positive function of the gradient history and prior iterates will generate some such sequence of positive steps. Consequently, the non-convergence holds for all such adaptive rules. We will revise the statement of the theorem and the surrounding discussion to explicitly indicate that the result applies to arbitrary positive step-size sequences, thereby encompassing all history-dependent adaptive schedules. The bounded gradient norms are leveraged in the construction to ensure the oscillation is independent of the particular step sizes. revision: yes
Referee: [error feedback section] The error-feedback convergence restoration (section on error feedback) is stated for 'all the non-Euclidean subgradient methods with momentum.' The argument should clarify whether the same step-size restrictions or additional assumptions (e.g., on the momentum parameter or the geometry) are required for each method, and whether the restored convergence rate is comparable to standard subgradient methods.

Authors: The convergence result with error feedback applies to all considered non-Euclidean subgradient methods with momentum under the standard step-size conditions for subgradient methods, namely that the stepsizes satisfy sum eta_t = infinity and sum eta_t^2 < infinity. The momentum parameter is taken in the open interval (0,1), consistent with typical usage, and the underlying geometry is that of the respective method. The restored convergence rate is the same as that of standard subgradient methods with error feedback, which is O(1/sqrt(T)) in the worst case. We will include an additional remark in the error feedback section to make these conditions and the rate explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: direct counterexample proof on convex Lipschitz class

full rationale

The paper establishes non-convergence of Muon via an explicit counterexample construction on the convex Lipschitz function class, independent of any fitted parameters, self-referential definitions, or load-bearing self-citations. The central claim follows from the optimizer update rule and the properties of the chosen function, without reducing any derived quantity back to the inputs by construction. Error-feedback restoration is shown separately as a positive result on the same class. The derivation is self-contained as a mathematical negative result and does not invoke prior author work to force uniqueness or smuggle ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain assumption that the objective belongs to the convex Lipschitz class and on the specific update rule of Muon; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The objective function is convex and Lipschitz continuous
Invoked to define the function class on which non-convergence is proven.

pith-pipeline@v0.9.0 · 5527 in / 1160 out tokens · 64981 ms · 2026-05-12T02:13:00.660771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · 2 internal anchors

[1]

and Combettes, Patrick L

Bauschke, Heinz H. and Combettes, Patrick L. , year =. Convex analysis and monotone operator theory in

work page
[2]

2025 , journal =

Optimization benchmark for diffusion models on dynamical systems , author =. 2025 , journal =

work page 2025
[3]

Error feedback for

Gruntkowska, Kaja and Gaponov, Alexander and Tovmasyan, Zhirayr and Richt. Error feedback for. The Fourteenth International Conference on Learning Representations , year =

work page
[4]

2015 , series =

Convex optimization in normed spaces , author =. 2015 , series =

work page 2015
[5]

Duchi, John and Hazan, Elad and Singer, Yoram , title =. J. Mach. Learn. Res. , month = jul, pages =. 2011 , publisher =

work page 2011
[6]

and van Loan, Charles F

Golub, Gene H. and van Loan, Charles F. , edition =. Matrix computations , url =

work page
[7]

2024 , journal =

A spectral condition for feature learning , author =. 2024 , journal =

work page 2024
[8]

2016 , journal =

Layer normalization , author =. 2016 , journal =

work page 2016
[9]

and Yudin, David B

Nemirovsky, Arkadi S. and Yudin, David B. , title =. 1983 , publisher =

work page 1983
[10]

Dissecting

Lukas Balles and Philipp Hennig , booktitle =. Dissecting. 2018 , url =

work page 2018
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Root mean square layer normalization , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[12]

Error feedback fixes

Karimireddy, Sai Praneeth and Rebjock, Quentin and Stich, Sebastian and Jaggi, Martin , booktitle =. Error feedback fixes. 2019 , volume =

work page 2019
[13]

Gower , booktitle =

Antonio Orvieto and Robert M. Gower , booktitle =. In search of. 2025 , url =

work page 2025
[14]

Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger B

Wu Lin and Scott C. Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger B. Grosse , booktitle =. Understanding and improving

work page
[15]

Nikhil Vyas and Depen Morwani and Rosie Zhao and Mujin Kwun and Itai Shapira and David Brandfonbrener Imber and Lucas Janson and Sham Kakade , booktitle =

work page
[16]

arXiv preprint arXiv:1706.06569 , year =

A unified approach to adaptive regularization in online and stochastic optimization , author =. arXiv preprint arXiv:1706.06569 , year =

work page arXiv
[17]

Forty-second International Conference on Machine Learning , year =

Structured preconditioners in adaptive optimization: A unified analysis , author =. Forty-second International Conference on Machine Learning , year =

work page
[18]

2004 , publisher =

Nesterov, Yurii , title =. 2004 , publisher =

work page 2004
[19]

Proceedings of the 30th International Conference on Machine Learning , year =

Jaggi, Martin , title =. Proceedings of the 30th International Conference on Machine Learning , year =

work page
[20]

Optimizing neural networks with

James Martens and Roger Grosse , booktitle =. Optimizing neural networks with. 2015 , organization =

work page 2015
[21]

and Johnson, Charles R

Horn, Roger A. and Johnson, Charles R. , title =. 2013 , publisher =

work page 2013
[22]

, title =

Higham, Nicholas J. , title =. 2008 , publisher =

work page 2008
[23]

, title =

Higham, Nicholas J. , title =. SIAM Review , year =

work page
[24]

and Mahony, Robert and Sepulchre, Rodolphe , title =

Absil, P.-A. and Mahony, Robert and Sepulchre, Rodolphe , title =. 2008 , publisher =

work page 2008
[25]

Proceedings of the 36th International Conference on Machine Learning , pages =

Target-based temporal-difference learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

work page 2019
[26]

2018 , volume =

Vineet Gupta and Tomer Koren and Yoram Singer , booktitle =. 2018 , volume =

work page 2018
[27]

and Cordonnier, Jean-Baptiste and Jaggi, Martin , year =

Stich, Sebastian U. and Cordonnier, Jean-Baptiste and Jaggi, Martin , year =. Sparsified. Advances in Neural Information Processing Systems , volume =

work page
[28]

Interspeech , author =

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech. Interspeech , author =. 2014 , volume =

work page 2014
[29]

Proceedings of the 42nd International Conference on Machine Learning , pages =

The surprising agreement between convex optimization theory and learning-rate scheduling for large model training , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

work page 2025
[30]

2010 , journal =

Kanamori, Takafumi and Ohara, Atsumi , title =. 2010 , journal =

work page 2010
[31]

, title =

Fletcher, R. , title =. SIAM Journal on Optimization , volume =

work page
[32]

Reddi and Satyen Kale and Sanjiv Kumar , booktitle =

Sashank J. Reddi and Satyen Kale and Sanjiv Kumar , booktitle =. On the convergence of

work page
[33]

Gower and Aaron Defazio and Mike Rabbat , title =

Robert M. Gower and Aaron Defazio and Mike Rabbat , title =. arXiv preprint arXiv:2106.11851 , year =

work page arXiv
[34]

Simon Bartels and Jon Cockayne and Ilse C. F. Ipsen and Philipp Hennig , title =. Statistics and Computing , year =

work page
[35]

Probabilistic linear algebra , author =

work page
[36]

Stochastic

Loizou, Nicolas and Vaswani, Sharan and Laradji, Issam and Lacoste-Julien, Simon , journal =. Stochastic

work page
[37]

2009 , publisher =

Convex optimization theory , author =. 2009 , publisher =

work page 2009
[38]

Gower and Othmane Sebbouh and Nicolas Loizou , title =

Robert M. Gower and Othmane Sebbouh and Nicolas Loizou , title =. arXiv preprint arXiv:2006.10311 , year =

work page arXiv 2006
[39]

Proceedings of the 37th International Conference on Machine Learning , pages =

Training neural networks for and by interpolation , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , volume =

work page 2020
[40]

and Blondel, Mathieu and Gazagnadou, Nidham and Pedregosa, Fabian , title =

Gower, Robert M. and Blondel, Mathieu and Gazagnadou, Nidham and Pedregosa, Fabian , title =. arXiv preprint arXiv:2202.12328 , year =

work page arXiv
[41]

Proceedings of the 32nd International Conference on Machine Learning , pages =

A trust-region method for stochastic variational inference with applications to streaming data , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , volume =

work page 2015
[42]

Gower , title =

Rui Yuan and Alessandro Lazaric and Robert M. Gower , title =. SIAM Journal on Optimization , year =

work page
[43]

Proceedings of The Eleventh Asian Conference on Machine Learning , pages =

Trust region sequential variational inference , author =. Proceedings of The Eleventh Asian Conference on Machine Learning , pages =. 2019 , volume =

work page 2019
[44]

2024 , journal =

Old optimizer, new norm: An anthology , author =. 2024 , journal =

work page 2024
[45]

arXiv:2410.21265 , year=

Modular duality in deep learning , author =. arXiv preprint arXiv:2410.21265 , year =

work page arXiv
[46]

2024 , howpublished =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , howpublished =

work page 2024
[47]

2026 , howpublished =

work page 2026
[48]

Training deep learning models with norm-constrained

Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan , booktitle =. Training deep learning models with norm-constrained

work page
[49]

and Ba, Jimmy , booktitle =

Kingma, Diederik P. and Ba, Jimmy , booktitle =

work page
[50]

Gower , booktitle =

Fabian Schaipp and Ruben Ohana and Michael Eickenberg and Aaron Defazio and Robert M. Gower , booktitle =

work page
[51]

Stochastic spectral descent for restricted

Carlson, David and Cevher, Volkan and Carin, Lawrence , booktitle =. Stochastic spectral descent for restricted. 2015 , volume =

work page 2015
[52]

The road less scheduled , volume =

Defazio, Aaron and Yang, Xingyu and Mehta, Harsh and Mishchenko, Konstantin and Khaled, Ahmed and Cutkosky, Ashok , booktitle =. The road less scheduled , volume =

work page
[53]

and Collins, Edo and Hsieh, Ya-Ping and Carin, Lawrence and Cevher, Volkan , booktitle =

Carlson, David E. and Collins, Edo and Hsieh, Ya-Ping and Carin, Lawrence and Cevher, Volkan , booktitle =. Preconditioned spectral descent for deep learning , volume =

work page
[54]

Generalized gradient norm clipping & non-

Thomas Pethick and Wanyun Xie and Mete Erdogan and Kimon Antonakopoulos and Tony Silveti-Falls and Volkan Cevher , booktitle =. Generalized gradient norm clipping & non-

work page
[55]

2025 , howpublished =

Su Jianlin , title =. 2025 , howpublished =

work page 2025
[56]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

When do spectral gradient updates help in deep learning? , author =. arXiv preprint arXiv:2512.04299 , year =

work page arXiv
[57]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

Michael Timothy Crawshaw and Chirag Modi and Mingrui Liu and Robert M. Gower , title =. arXiv preprint arXiv:2510.09827 , year =

work page arXiv
[58]

Higham.Functions of Matrices: Theory and Computation

Higham, Nicholas J. , title =. 2008 , pages =. doi:10.1137/1.9780898717778 , url =

work page doi:10.1137/1.9780898717778 2008
[59]

2025 , howpublished =

Franz Louis Cesista and You Jiacheng and Keller Jordan , title =. 2025 , howpublished =

work page 2025
[60]

Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and others , journal =

work page
[61]

Lau, Tim Tsz-Kit and Long, Qi and Su, Weijie , journal =

work page
[62]

and McAuliffe, Jon , booktitle =

Regier, Jeffrey and Jordan, Michael I. and McAuliffe, Jon , booktitle =. Fast black-box variational inference through stochastic trust-region optimization , volume =

work page
[63]

arXiv preprint arXiv:2304.05187 , year =

Automatic gradient descent: Deep learning without hyperparameters , author =. arXiv preprint arXiv:2304.05187 , year =

work page arXiv
[64]

Understanding gradient orthogonalization for deep learning via

Kovalev, Dmitry , journal =. Understanding gradient orthogonalization for deep learning via

work page
[65]

Implicit bias of spectral descent and

Fan, Chen and Schmidt, Mark and Thrampoulidis, Christos , journal =. Implicit bias of spectral descent and

work page
[66]

Guilherme Penedo and Hynek Kydl. The. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page
[67]

and Hestness, Joel and Dey, Nolan , title =

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R. and Hestness, Joel and Dey, Nolan , title =

work page
[68]

Jordan, Keller and Bernstein, Jeremy and Rappazzo, Ben and Boza, Vlado and You, Jiacheng and Cesista, Franz and Koszarsky, Braden , year =

work page
[69]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[70]

Gower , booktitle =

Noah Amsel and David Persson and Christopher Musco and Robert M. Gower , booktitle =. The

work page
[71]

The Thirteenth International Conference on Learning Representations , year =

Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks , author =. The Thirteenth International Conference on Learning Representations , year =

work page
[72]

Advances in Neural Information Processing Systems , volume =

Scalable optimization in the modular norm , author =. Advances in Neural Information Processing Systems , volume =

work page
[73]

Artem Riabinin and Egor Shulgin and Kaja Gruntkowska and Peter Richt. From. Proceedings of the 43rd International Conference on Machine Learning , series =

work page
[74]

2009 , publisher =

Learning multiple layers of features from tiny images , author =. 2009 , publisher =

work page 2009
[75]

2025 , journal =

Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal? , author =. 2025 , journal =

work page 2025
[76]

Visualizing the loss landscape of neural nets , volume =

Li, Hao and Xu, Zheng and Taylor, Gavin and Studer, Christoph and Goldstein, Tom , booktitle =. Visualizing the loss landscape of neural nets , volume =

work page
[77]

Advances in Neural Information Processing Systems , volume =

Robust, accurate stochastic optimization for variational inference , author =. Advances in Neural Information Processing Systems , volume =

work page
[78]

International Conference on Machine Learning , pages =

Yes, but did it work?: Evaluating variational inference , author =. International Conference on Machine Learning , pages =

work page
[79]

arXiv preprint arXiv:2203.15945 , year =

Robust, automated, and accurate black-box variational inference , author =. arXiv preprint arXiv:2203.15945 , year =

work page arXiv
[80]

and Kucukelbir, Alp and McAuliffe, Jon D

Blei, David M. and Kucukelbir, Alp and McAuliffe, Jon D. , journal =. Variational inference:. 2017 , publisher =

work page 2017

Showing first 80 references.