Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach
Pith reviewed 2026-06-30 20:23 UTC · model grok-4.3
The pith
Under local PL with multiplicative noise, (S)GD matches the asymptotic rate of strongly convex quadratics
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a geometric interpretation of the PL-condition, the authors prove that in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.
What carries the argument
Geometric interpretation of the local Polyak-Lojasiewicz inequality under the multiplicative gradient noise model
Load-bearing premise
The function must be C² smooth, satisfy the local PL inequality, and the gradient noise must be multiplicative.
What would settle it
Observe or simulate SGD on a C² function obeying local PL and multiplicative noise and find that its asymptotic rate is slower than the known quadratic rate.
read the original abstract
Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for C² functions satisfying the local Polyak-Łojasiewicz inequality under a multiplicative gradient noise model (motivated by overparameterized networks), the asymptotic convergence rates of GD and SGD match the linear rates known for strongly convex quadratics. The proof relies on a geometric interpretation of the PL condition that reduces local dynamics to a linear contraction once inside the PL region.
Significance. If correct, the result supplies a clean geometric unification of local convergence rates across convex and non-convex regimes, with direct relevance to optimization landscapes in machine learning. The explicit multiplicative-noise assumption and geometric reduction are strengths that make the rate-equivalence claim falsifiable and potentially useful for algorithm design.
major comments (1)
- [noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the paper's significance and for the constructive major comment. We address it point-by-point below.
read point-by-point responses
-
Referee: [noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.
Authors: We agree that the multiplicative structure is essential for inheriting the linear rate and that an explicit verification is needed. In the local-analysis section (following the geometric reduction to the PL region), the proof already contains this step: because the noise variance scales with ||∇f(x)||² and the local PL inequality gives ||∇f(x)||² ≥ μ(f(x)−f*), the stochastic term is bounded by a quantity that contracts at the same linear rate as the deterministic gradient term. This appears in the derivation of the one-step contraction for E[f(x_{k+1})] (immediately after the geometric reduction is invoked). To address the referee’s request for greater explicitness, we will insert a short dedicated paragraph and a displayed inequality that isolates the vanishing of the noise term. revision: partial
Circularity Check
No significant circularity; direct geometric derivation from stated assumptions
full rationale
The paper derives the asymptotic rate equivalence via a geometric interpretation of the local PL inequality under C² smoothness and multiplicative noise. No quoted step reduces a claimed prediction to a fitted parameter, self-definition, or load-bearing self-citation chain. The central result follows from the explicit assumptions (local PL region + multiplicative noise vanishing with the gradient) without tautological reduction to inputs. This is the expected non-circular outcome for a self-contained proof paper.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Functions are twice continuously differentiable (C²)
- domain assumption The Polyak-Lojasiewicz inequality holds locally
- domain assumption Gradient noise follows a multiplicative model
Reference graph
Works this paper leans on
-
[1]
Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005
Pierre-Antoine Absil, Robert Mahony, and Ben Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005
2005
-
[2]
Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods
Hedy Attouch, J´ erˆ ome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013
2013
-
[3]
Dynamics of stochastic approximation algorithms
Michel Bena¨ ım. Dynamics of stochastic approximation algorithms. InSeminaire de probabilites XXXIII, pages 1–68. Springer, 2006
2006
-
[4]
Optimization methods for large-scale machine learning
L´ eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018
2018
-
[5]
If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025
Chris Criscitiello, Quentin Rebjock, and Nicolas Boumal. If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025
2025
-
[6]
Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron
Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:1–48, 2023
2023
-
[7]
Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J
Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J. Mach. Learn., 3(3):245–281, 2024
2024
-
[8]
On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc
Paul Feehan. On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc. Var. Partial Differential Equations, 59(2):87, 2020
2020
-
[9]
Fehrman, B
B. Fehrman, B. Gess, and A. Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions.J. Mach. Learn. Res., 21(136):1–48, 2020
2020
-
[10]
Zixu Feng and Hao Yuan. Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026
-
[11]
arXiv preprint arXiv:2301.11235 , year=
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023
-
[12]
Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026
Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026
2026
-
[13]
Nesterov acceleration in benignly non-convex landscapes
Kanan Gupta and Stephan Wojtowytsch. Nesterov acceleration in benignly non-convex landscapes. InInter- national Conference on Learning Representations, volume 2025, pages 18741–18773, 2025
2025
-
[14]
Karimi, J
H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods un- der the Polyak- Lojasiewicz condition. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016
2016
-
[15]
Sebastian Kassing and Simon Weissmann. Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024
-
[16]
Khaled and P
A. Khaled and P. Richt´ arik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023
2023
-
[17]
Springer, 2003
Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003
2003
-
[18]
Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023
Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023
2023
-
[19]
Springer Science & Business Media, 1992
Lennart Ljung, Georg Ch Pflug, and Harro Walk.Stochastic approximation and optimization of random systems, volume 17. Springer Science & Business Media, 1992
1992
-
[20]
Lojasiewicz
S. Lojasiewicz. Une propri´ et´ e topologique des sous-ensembles analytiques r´ eels.Les ´ equations aux d´ eriv´ ees partielles, 117:87–89, 1963
1963
-
[21]
The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning
Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018
2018
-
[22]
On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020
Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020
2020
-
[23]
Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions
Aziz Ben Nejma. Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions. arXiv preprint arXiv:2512.05285, 2025
-
[24]
Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018
Peter Ochs. Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018. OPTIMAL ASYMPTOTIC RATES FOR (S)GD UNDER THE LOCAL PL-CONDITION 21
2018
-
[25]
B. T. Polyak. Gradient methods for the minimisation of functionals.U.S.S.R. Comput. Math. and Math. Phys., 3(4):864–878, 1963
1963
-
[26]
B. T. Polyak. Some methods of speeding up the convergence of iteration methods.U.S.S.R. Comput. Math. and Math. Phys., 4(5):1–17, 1964
1964
-
[27]
Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math
Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math. Program., 213(1):151–199, 2025
2025
-
[28]
A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951
1951
-
[29]
Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015
Vladislav B Tadi´ c. Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015
2015
-
[30]
Vaswani, F
S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019
2019
-
[31]
Wojtowytsch
S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis. J. Nonlinear Sci., 33(3):45, 2023. Sebastian Kassing, Department of Mathematics & Informatics, University of Wuppertal, 42119 Wuppertal, Germany Email address:kassing@uni-wuppertal.de Thomas Kruse, Department of Mathematics & Informatics, Univer...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.