pith. sign in

arxiv: 2605.14663 · v1 · pith:7KNII4V5new · submitted 2026-05-14 · 🧮 math.OC · math.PR· stat.ML

Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

Pith reviewed 2026-06-30 20:23 UTC · model grok-4.3

classification 🧮 math.OC math.PRstat.ML
keywords local Polyak-Lojasiewicz inequalitygradient descentstochastic gradient descentasymptotic convergence ratesnon-convex optimizationmultiplicative noisegeometric interpretation
0
0 comments X

The pith

Under local PL with multiplicative noise, (S)GD matches the asymptotic rate of strongly convex quadratics

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent and stochastic gradient descent achieve the same asymptotic convergence rates on C² functions satisfying the local Polyak-Lojasiewicz inequality as they do on strongly convex quadratics. This equivalence holds even though the functions may be non-convex globally, provided the gradient noise obeys a multiplicative model. The authors reach this conclusion through a geometric reading of the PL condition. A sympathetic reader would care because the result indicates that local PL behavior alone can deliver the optimal rates that are usually derived only under global strong convexity.

Core claim

Using a geometric interpretation of the PL-condition, the authors prove that in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.

What carries the argument

Geometric interpretation of the local Polyak-Lojasiewicz inequality under the multiplicative gradient noise model

Load-bearing premise

The function must be C² smooth, satisfy the local PL inequality, and the gradient noise must be multiplicative.

What would settle it

Observe or simulate SGD on a C² function obeying local PL and multiplicative noise and find that its asymptotic rate is slower than the known quadratic rate.

read the original abstract

Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that for C² functions satisfying the local Polyak-Łojasiewicz inequality under a multiplicative gradient noise model (motivated by overparameterized networks), the asymptotic convergence rates of GD and SGD match the linear rates known for strongly convex quadratics. The proof relies on a geometric interpretation of the PL condition that reduces local dynamics to a linear contraction once inside the PL region.

Significance. If correct, the result supplies a clean geometric unification of local convergence rates across convex and non-convex regimes, with direct relevance to optimization landscapes in machine learning. The explicit multiplicative-noise assumption and geometric reduction are strengths that make the rate-equivalence claim falsifiable and potentially useful for algorithm design.

major comments (1)
  1. [noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the paper's significance and for the constructive major comment. We address it point-by-point below.

read point-by-point responses
  1. Referee: [noise-model assumption and local PL analysis] The multiplicative noise model (variance scaling with ||∇f||) is load-bearing for the claimed rate match. Under additive noise the stochastic perturbation remains O(1) while the gradient term vanishes, preventing inheritance of the deterministic linear rate. The manuscript must therefore contain an explicit step (in the local-analysis section following the geometric reduction) showing that the noise term vanishes at the same rate as the deterministic term inside the PL region; without this step the equivalence to the quadratic case does not follow from C² + local PL alone.

    Authors: We agree that the multiplicative structure is essential for inheriting the linear rate and that an explicit verification is needed. In the local-analysis section (following the geometric reduction to the PL region), the proof already contains this step: because the noise variance scales with ||∇f(x)||² and the local PL inequality gives ||∇f(x)||² ≥ μ(f(x)−f*), the stochastic term is bounded by a quantity that contracts at the same linear rate as the deterministic gradient term. This appears in the derivation of the one-step contraction for E[f(x_{k+1})] (immediately after the geometric reduction is invoked). To address the referee’s request for greater explicitness, we will insert a short dedicated paragraph and a displayed inequality that isolates the vanishing of the noise term. revision: partial

Circularity Check

0 steps flagged

No significant circularity; direct geometric derivation from stated assumptions

full rationale

The paper derives the asymptotic rate equivalence via a geometric interpretation of the local PL inequality under C² smoothness and multiplicative noise. No quoted step reduces a claimed prediction to a fitted parameter, self-definition, or load-bearing self-citation chain. The central result follows from the explicit assumptions (local PL region + multiplicative noise vanishing with the gradient) without tautological reduction to inputs. This is the expected non-circular outcome for a self-contained proof paper.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The result rests on three domain assumptions stated in the abstract: C² smoothness, local validity of the PL inequality, and the multiplicative noise model motivated by neural networks. No free parameters or invented entities are mentioned.

axioms (3)
  • domain assumption Functions are twice continuously differentiable (C²)
    Required for the local analysis and geometric interpretation of PL.
  • domain assumption The Polyak-Lojasiewicz inequality holds locally
    Central modeling choice enabling the rate comparison.
  • domain assumption Gradient noise follows a multiplicative model
    Motivated by overparameterized neural networks; used to model stochasticity.

pith-pipeline@v0.9.1-grok · 5634 in / 1312 out tokens · 28571 ms · 2026-06-30T20:23:26.147813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 4 canonical work pages

  1. [1]

    Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

    Pierre-Antoine Absil, Robert Mahony, and Ben Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

  2. [2]

    Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods

    Hedy Attouch, J´ erˆ ome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013

  3. [3]

    Dynamics of stochastic approximation algorithms

    Michel Bena¨ ım. Dynamics of stochastic approximation algorithms. InSeminaire de probabilites XXXIII, pages 1–68. Springer, 2006

  4. [4]

    Optimization methods for large-scale machine learning

    L´ eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

  5. [5]

    If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

    Chris Criscitiello, Quentin Rebjock, and Nicolas Boumal. If a smooth function is globally P L and coercive, then it has a unique minimizer.https://www.racetothebottom.xyz/posts/PL-smooth-unique/, 2025

  6. [6]

    Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

    Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:1–48, 2023

  7. [7]

    Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J

    Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Lojasiewicz- landscapes.J. Mach. Learn., 3(3):245–281, 2024

  8. [8]

    On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc

    Paul Feehan. On the Morse–Bott property of analytic functions on banach spaces with Lojasiewicz exponent one half.Calc. Var. Partial Differential Equations, 59(2):87, 2020

  9. [9]

    Fehrman, B

    B. Fehrman, B. Gess, and A. Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions.J. Mach. Learn. Res., 21(136):1–48, 2020

  10. [10]

    Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

    Zixu Feng and Hao Yuan. Optimal local linear convergence of nesterov’s accelerated gradient method forc 2 functions under the Polyak– Lojasiewicz inequality.arXiv preprint arXiv:2603.21516, 2026

  11. [11]

    arXiv preprint arXiv:2301.11235 , year=

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

  12. [12]

    Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

    Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting.Mathematical Programming, pages 1–37, 2026

  13. [13]

    Nesterov acceleration in benignly non-convex landscapes

    Kanan Gupta and Stephan Wojtowytsch. Nesterov acceleration in benignly non-convex landscapes. InInter- national Conference on Learning Representations, volume 2025, pages 18741–18773, 2025

  14. [14]

    Karimi, J

    H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods un- der the Polyak- Lojasiewicz condition. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016

  15. [15]

    Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

    Sebastian Kassing and Simon Weissmann. Polyak’s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality.arXiv preprint arXiv:2410.16849, 2024

  16. [16]

    Khaled and P

    A. Khaled and P. Richt´ arik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

  17. [17]

    Springer, 2003

    Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003

  18. [18]

    Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

    Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the mini- mizers: fast convergence of SGD for overparametrized problems.Advances in neural information processing systems, 36:60748–60767, 2023

  19. [19]

    Springer Science & Business Media, 1992

    Lennart Ljung, Georg Ch Pflug, and Harro Walk.Stochastic approximation and optimization of random systems, volume 17. Springer Science & Business Media, 1992

  20. [20]

    Lojasiewicz

    S. Lojasiewicz. Une propri´ et´ e topologique des sous-ensembles analytiques r´ eels.Les ´ equations aux d´ eriv´ ees partielles, 117:87–89, 1963

  21. [21]

    The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018

  22. [22]

    On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

    Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

  23. [23]

    Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions

    Aziz Ben Nejma. Polyak- Lojasiewicz inequality is essentially no more general than strong convexity forC 2 functions. arXiv preprint arXiv:2512.05285, 2025

  24. [24]

    Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018

    Peter Ochs. Local convergence of the Heavy-Ball method and iPiano for Non-convex optimization.Journal of Optimization Theory and Applications, 177:153–180, 2018. OPTIMAL ASYMPTOTIC RATES FOR (S)GD UNDER THE LOCAL PL-CONDITION 21

  25. [25]

    B. T. Polyak. Gradient methods for the minimisation of functionals.U.S.S.R. Comput. Math. and Math. Phys., 3(4):864–878, 1963

  26. [26]

    B. T. Polyak. Some methods of speeding up the convergence of iteration methods.U.S.S.R. Comput. Math. and Math. Phys., 4(5):1–17, 1964

  27. [27]

    Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math

    Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions forC 2 functions.Math. Program., 213(1):151–199, 2025

  28. [28]

    A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical sta- tistics, pages 400–407, 1951

  29. [29]

    Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

    Vladislav B Tadi´ c. Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema.Stochastic Processes and their Applications, 125(5):1715–1755, 2015

  30. [30]

    Vaswani, F

    S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

  31. [31]

    Wojtowytsch

    S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis. J. Nonlinear Sci., 33(3):45, 2023. Sebastian Kassing, Department of Mathematics & Informatics, University of Wuppertal, 42119 Wuppertal, Germany Email address:kassing@uni-wuppertal.de Thomas Kruse, Department of Mathematics & Informatics, Univer...