Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

Sebastian Kassing; Thomas Kruse

arxiv: 2605.14663 · v1 · pith:7KNII4V5new · submitted 2026-05-14 · 🧮 math.OC · math.PR· stat.ML

Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

Sebastian Kassing , Thomas Kruse This is my paper

Pith reviewed 2026-06-30 20:23 UTC · model grok-4.3

classification 🧮 math.OC math.PRstat.ML

keywords local Polyak-Lojasiewicz inequalitygradient descentstochastic gradient descentasymptotic convergence ratesnon-convex optimizationmultiplicative noisegeometric interpretation

0 comments

The pith

Under local PL with multiplicative noise, (S)GD matches the asymptotic rate of strongly convex quadratics

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent and stochastic gradient descent achieve the same asymptotic convergence rates on C² functions satisfying the local Polyak-Lojasiewicz inequality as they do on strongly convex quadratics. This equivalence holds even though the functions may be non-convex globally, provided the gradient noise obeys a multiplicative model. The authors reach this conclusion through a geometric reading of the PL condition. A sympathetic reader would care because the result indicates that local PL behavior alone can deliver the optimal rates that are usually derived only under global strong convexity.

Core claim

Using a geometric interpretation of the PL-condition, the authors prove that in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.

What carries the argument

Geometric interpretation of the local Polyak-Lojasiewicz inequality under the multiplicative gradient noise model

Load-bearing premise

The function must be C² smooth, satisfy the local PL inequality, and the gradient noise must be multiplicative.

What would settle it

Observe or simulate SGD on a C² function obeying local PL and multiplicative noise and find that its asymptotic rate is slower than the known quadratic rate.

read the original abstract

Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Local PL with multiplicative noise lets (S)GD match the strongly convex quadratic asymptotic rate via a geometric reduction.

read the letter

The core result is that under C2 smoothness, a local PL inequality, and multiplicative gradient noise, both GD and SGD achieve the same asymptotic linear rate as in the strongly convex quadratic case. The argument rests on a geometric reading of PL that turns the local dynamics into a contraction once inside the region.

The geometric step is the main contribution. It gives a direct way to transfer the classical rate without extra assumptions or heavy machinery. The multiplicative noise model is handled cleanly so the stochastic term vanishes at the same pace as the gradient, preserving the deterministic contraction asymptotically. This matches the motivation from overparameterized networks.

The paper states its scope plainly: the claim is local and asymptotic, and the noise must be multiplicative. With additive noise the equivalence would fail because the perturbation stays order one while the gradient shrinks. That dependence is load-bearing but is not hidden. The C2 and local-PL conditions are standard and sufficient for the reduction they use.

A minor limitation is that reaching the local PL region still needs separate argument in any global analysis, but the paper does not claim to solve that. The result is narrow but precise.

This is for readers working on optimization rates in machine learning who already accept local PL as a modeling assumption. It offers a shortcut for asymptotic analysis that could be cited when the noise model fits. The work is coherent on its own terms and deserves a serious referee.

Referee Report

1 major / 0 minor

Summary. The paper claims that for C² functions satisfying the local Polyak-Łojasiewicz inequality under a multiplicative gradient noise model (motivated by overparameterized networks), the asymptotic convergence rates of GD and SGD match the linear rates known for strongly convex quadratics. The proof relies on a geometric interpretation of the PL condition that reduces local dynamics to a linear contraction once inside the PL region.

Significance. If correct, the result supplies a clean geometric unification of local convergence rates across convex and non-convex regimes, with direct relevance to optimization landscapes in machine learning. The explicit multiplicative-noise assumption and geometric reduction are strengths that make the rate-equivalence claim falsifiable and potentially useful for algorithm design.

major comments (1)

[noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the paper's significance and for the constructive major comment. We address it point-by-point below.

read point-by-point responses

Referee: [noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.

Authors: We agree that the multiplicative structure is essential for inheriting the linear rate and that an explicit verification is needed. In the local-analysis section (following the geometric reduction to the PL region), the proof already contains this step: because the noise variance scales with ||∇f(x)||² and the local PL inequality gives ||∇f(x)||² ≥ μ(f(x)−f*), the stochastic term is bounded by a quantity that contracts at the same linear rate as the deterministic gradient term. This appears in the derivation of the one-step contraction for E[f(x_{k+1})] (immediately after the geometric reduction is invoked). To address the referee’s request for greater explicitness, we will insert a short dedicated paragraph and a displayed inequality that isolates the vanishing of the noise term. revision: partial

Circularity Check

0 steps flagged

No significant circularity; direct geometric derivation from stated assumptions

full rationale

The paper derives the asymptotic rate equivalence via a geometric interpretation of the local PL inequality under C² smoothness and multiplicative noise. No quoted step reduces a claimed prediction to a fitted parameter, self-definition, or load-bearing self-citation chain. The central result follows from the explicit assumptions (local PL region + multiplicative noise vanishing with the gradient) without tautological reduction to inputs. This is the expected non-circular outcome for a self-contained proof paper.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The result rests on three domain assumptions stated in the abstract: C² smoothness, local validity of the PL inequality, and the multiplicative noise model motivated by neural networks. No free parameters or invented entities are mentioned.

axioms (3)

domain assumption Functions are twice continuously differentiable (C²)
Required for the local analysis and geometric interpretation of PL.
domain assumption The Polyak-Lojasiewicz inequality holds locally
Central modeling choice enabling the rate comparison.
domain assumption Gradient noise follows a multiplicative model
Motivated by overparameterized neural networks; used to model stochasticity.

pith-pipeline@v0.9.1-grok · 5634 in / 1312 out tokens · 28571 ms · 2026-06-30T20:23:26.147813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 4 canonical work pages

[1]

Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

Pierre-Antoine Absil, Robert Mahony, and Ben Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

2005
[2]

Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods

Hedy Attouch, J´ erˆ ome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013

2013
[3]

Dynamics of stochastic approximation algorithms

Michel Bena¨ ım. Dynamics of stochastic approximation algorithms. InSeminaire de probabilites XXXIII, pages 1–68. Springer, 2006

2006
[4]

Optimization methods for large-scale machine learning

L´ eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

2018
[5]

If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

Chris Criscitiello, Quentin Rebjock, and Nicolas Boumal. If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

2025
[6]

Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:1–48, 2023

2023
[7]

Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J. Mach. Learn., 3(3):245–281, 2024

2024
[8]

On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc

Paul Feehan. On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc. Var. Partial Differential Equations, 59(2):87, 2020

2020
[9]

Fehrman, B

B. Fehrman, B. Gess, and A. Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions.J. Mach. Learn. Res., 21(136):1–48, 2020

2020
[10]

Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

Zixu Feng and Hao Yuan. Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

work page arXiv 2026
[11]

arXiv preprint arXiv:2301.11235 , year=

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023
[12]

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

2026
[13]

Nesterov acceleration in benignly non-convex landscapes

Kanan Gupta and Stephan Wojtowytsch. Nesterov acceleration in benignly non-convex landscapes. InInter- national Conference on Learning Representations, volume 2025, pages 18741–18773, 2025

2025
[14]

Karimi, J

H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods un- der the Polyak- Lojasiewicz condition. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016

2016
[15]

Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

Sebastian Kassing and Simon Weissmann. Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

work page arXiv 2024
[16]

Khaled and P

A. Khaled and P. Richt´ arik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

2023
[17]

Springer, 2003

Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003

2003
[18]

Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

2023
[19]

Springer Science & Business Media, 1992

Lennart Ljung, Georg Ch Pflug, and Harro Walk.Stochastic approximation and optimization of random systems, volume 17. Springer Science & Business Media, 1992

1992
[20]

Lojasiewicz

S. Lojasiewicz. Une propri´ et´ e topologique des sous-ensembles analytiques r´ eels.Les ´ equations aux d´ eriv´ ees partielles, 117:87–89, 1963

1963
[21]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018

2018
[22]

On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

2020
[23]

Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions

Aziz Ben Nejma. Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions. arXiv preprint arXiv:2512.05285, 2025

work page arXiv 2025
[24]

Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018

Peter Ochs. Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018. OPTIMAL ASYMPTOTIC RATES FOR (S)GD UNDER THE LOCAL PL-CONDITION 21

2018
[25]

B. T. Polyak. Gradient methods for the minimisation of functionals.U.S.S.R. Comput. Math. and Math. Phys., 3(4):864–878, 1963

1963
[26]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.U.S.S.R. Comput. Math. and Math. Phys., 4(5):1–17, 1964

1964
[27]

Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math

Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math. Program., 213(1):151–199, 2025

2025
[28]

A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

1951
[29]

Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

Vladislav B Tadi´ c. Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

2015
[30]

Vaswani, F

S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

2019
[31]

Wojtowytsch

S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis. J. Nonlinear Sci., 33(3):45, 2023. Sebastian Kassing, Department of Mathematics & Informatics, University of Wuppertal, 42119 Wuppertal, Germany Email address:kassing@uni-wuppertal.de Thomas Kruse, Department of Mathematics & Informatics, Univer...

2023

[1] [1]

Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

Pierre-Antoine Absil, Robert Mahony, and Ben Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

2005

[2] [2]

Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods

Hedy Attouch, J´ erˆ ome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013

2013

[3] [3]

Dynamics of stochastic approximation algorithms

Michel Bena¨ ım. Dynamics of stochastic approximation algorithms. InSeminaire de probabilites XXXIII, pages 1–68. Springer, 2006

2006

[4] [4]

Optimization methods for large-scale machine learning

L´ eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

2018

[5] [5]

If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

Chris Criscitiello, Quentin Rebjock, and Nicolas Boumal. If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

2025

[6] [6]

Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:1–48, 2023

2023

[7] [7]

Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J. Mach. Learn., 3(3):245–281, 2024

2024

[8] [8]

On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc

Paul Feehan. On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc. Var. Partial Differential Equations, 59(2):87, 2020

2020

[9] [9]

Fehrman, B

B. Fehrman, B. Gess, and A. Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions.J. Mach. Learn. Res., 21(136):1–48, 2020

2020

[10] [10]

Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

Zixu Feng and Hao Yuan. Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

work page arXiv 2026

[11] [11]

arXiv preprint arXiv:2301.11235 , year=

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023

[12] [12]

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

2026

[13] [13]

Nesterov acceleration in benignly non-convex landscapes

Kanan Gupta and Stephan Wojtowytsch. Nesterov acceleration in benignly non-convex landscapes. InInter- national Conference on Learning Representations, volume 2025, pages 18741–18773, 2025

2025

[14] [14]

Karimi, J

H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods un- der the Polyak- Lojasiewicz condition. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016

2016

[15] [15]

Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

Sebastian Kassing and Simon Weissmann. Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

work page arXiv 2024

[16] [16]

Khaled and P

A. Khaled and P. Richt´ arik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

2023

[17] [17]

Springer, 2003

Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003

2003

[18] [18]

Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

2023

[19] [19]

Springer Science & Business Media, 1992

Lennart Ljung, Georg Ch Pflug, and Harro Walk.Stochastic approximation and optimization of random systems, volume 17. Springer Science & Business Media, 1992

1992

[20] [20]

Lojasiewicz

S. Lojasiewicz. Une propri´ et´ e topologique des sous-ensembles analytiques r´ eels.Les ´ equations aux d´ eriv´ ees partielles, 117:87–89, 1963

1963

[21] [21]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018

2018

[22] [22]

On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

2020

[23] [23]

Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions

Aziz Ben Nejma. Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions. arXiv preprint arXiv:2512.05285, 2025

work page arXiv 2025

[24] [24]

Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018

Peter Ochs. Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018. OPTIMAL ASYMPTOTIC RATES FOR (S)GD UNDER THE LOCAL PL-CONDITION 21

2018

[25] [25]

B. T. Polyak. Gradient methods for the minimisation of functionals.U.S.S.R. Comput. Math. and Math. Phys., 3(4):864–878, 1963

1963

[26] [26]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.U.S.S.R. Comput. Math. and Math. Phys., 4(5):1–17, 1964

1964

[27] [27]

Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math

Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math. Program., 213(1):151–199, 2025

2025

[28] [28]

A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

1951

[29] [29]

Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

Vladislav B Tadi´ c. Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

2015

[30] [30]

Vaswani, F

S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

2019

[31] [31]

Wojtowytsch

S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis. J. Nonlinear Sci., 33(3):45, 2023. Sebastian Kassing, Department of Mathematics & Informatics, University of Wuppertal, 42119 Wuppertal, Germany Email address:kassing@uni-wuppertal.de Thomas Kruse, Department of Mathematics & Informatics, Univer...

2023