Stochastic algorithms with geometric step decay converge linearly on sharp functions
Pith reviewed 2026-05-24 17:50 UTC · model grok-4.3
The pith
Geometric step decay endows stochastic algorithms with local linear convergence on sharp nonconvex problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows the stochastic projected subgradient, proximal point, and prox-linear algorithms with a local linear rate of convergence to global minimizers.
What carries the argument
Geometric step decay schedule that halves the step size after every few epochs.
If this is right
- The stochastic projected subgradient method converges linearly to global minimizers on sharp nonconvex problems.
- The proximal point and prox-linear algorithms inherit the same local linear rate under geometric decay.
- Phase retrieval and blind deconvolution obtain matching or improved guarantees under both Gaussian and heavy-tailed measurements.
Where Pith is reading between the lines
- The local character of the result implies that a preliminary global phase may be needed to enter the basin before decay begins.
- The same schedule could simplify step-size selection in other nonconvex statistical estimation problems that satisfy sharpness.
- Heavy-tailed guarantees suggest the approach may tolerate heavier noise than typical analyses assume.
Load-bearing premise
Iterates remain inside a local neighborhood of the minimizer with high probability and do not escape.
What would settle it
A concrete sharp nonconvex instance in which the algorithm with geometric decay escapes the basin and the observed rate becomes sublinear.
Figures
read the original abstract
Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that geometric step decay schedules allow stochastic projected subgradient, proximal point, and prox-linear algorithms to achieve local linear convergence to global minimizers on a broad class of sharp, nonsmooth, nonconvex problems. The analysis is applied to phase retrieval and blind deconvolution, recovering optimal rates under Gaussian measurements and providing new rates under heavy-tailed noise.
Significance. If the escape-probability control holds, the result meaningfully extends the geometric-decay linear-rate phenomenon from the convex sharp case to nonconvex settings that arise in statistical recovery. The local nature of the basin and the explicit handling of stochastic escape are the novel technical elements.
major comments (2)
- [Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.
- [§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.
minor comments (2)
- [§2] The definition of the sharpness constant and the radius of the local basin should be stated once in a single display equation rather than re-introduced in each theorem.
- [Figure 1] Figure 1 caption does not indicate whether the plotted trajectories are single runs or averaged; this affects interpretation of the observed linear phase.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying these two technical points that require clarification. Both comments concern the rigor of the escape-probability control, which is central to extending geometric decay to the nonconvex setting. We address each below and will incorporate the requested explicit statements in the revision.
read point-by-point responses
-
Referee: [Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.
Authors: We agree that an explicit quantitative bound on the cumulative escape probability strengthens the result. While the proof of Theorem 3.1 already controls the per-epoch escape probability via a union bound whose terms decrease geometrically with the step-size halvings, the infinite-horizon summation is only implicit. In the revised version we will insert a new corollary (Corollary 3.2) immediately after Theorem 3.1 that sums the geometric series of per-phase escape probabilities and shows that, for any prescribed δ > 0, the initial step size and epoch lengths can be chosen so that the total escape probability over the infinite horizon is at most δ (in particular < 1/2). The argument uses only the same per-epoch tail bound already present in the proof and the geometric decay of the step sizes. revision: yes
-
Referee: [§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.
Authors: We acknowledge the need for explicit verification under moment assumptions. The escape control in §5.2 is based on a supermartingale inequality that depends solely on the second-moment bound stated in Assumption 5.3; this inequality was originally derived for the Gaussian case but holds verbatim under the weaker moment condition. Nevertheless, to remove any ambiguity we will revise the proof text to cite only the moment-based supermartingale bound, add a short remark confirming that no sub-Gaussian tail inequalities are invoked, and verify that the same union-bound argument over epochs continues to apply. revision: yes
Circularity Check
No circularity: new local analysis for nonconvex sharp case stands independently of convex prior.
full rationale
The abstract and available text present a new result extending geometric step decay to stochastic sharp nonconvex problems, explicitly noting the additional challenge of bounding escape probability from the local basin. No equations, fitted parameters, or self-citations are shown reducing the claimed linear rate to a definition or prior input by construction. The convex sharp case is cited as prior work but is not load-bearing for the nonconvex extension, which requires fresh supermartingale or union-bound arguments under noise. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective functions satisfy a sharpness condition that guarantees linear growth away from minimizers.
- domain assumption Stochastic oracles satisfy bounded variance or moment conditions sufficient for concentration.
Forward citations
Cited by 1 Pith paper
-
Convergence of difference inclusions via a diameter criterion
A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
H. Asi and J.C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. arXiv preprint arXiv:1810.05633 , 2018
-
[4]
H. Asi and J.C. Duchi. The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619 , 2019
- [5]
-
[6]
E.J. Cand` es, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recov- ery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. 26
work page 2013
-
[7]
Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence
V. Charisopoulos, Y. Chen, D. Davis, M. D´ ıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Composite optimization for robust blind deconvolution
V. Charisopoulos, D. Davis, M. D´ ıaz, and D. Drusvyatskiy. Composite optimization for robust blind deconvolution. arXiv:1901.01624, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[9]
Geometric step decay: reference implementation
COR-OPT. Geometric step decay: reference implementation. https://github.com/ COR-OPT/GeomStepDecay, 2019
work page 2019
-
[10]
D. Davis. SMART: The stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019
work page 2019
- [12]
-
[13]
The nonsmooth landscape of phase retrieval
D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase re- trieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 1646–1654. Curran Associates, Inc., 2014
work page 2014
-
[15]
A. Defazio, J. Domke, and T.S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning , volume 32 of Proceedings of Machine Learning Research, pages 1125–1133, Bejing, China, 22–24 Jun
-
[16]
D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News , 26(1), 2018
work page 2018
-
[17]
Optimality, identifiability, and sensitivity
D. Drusvyatskiy and A. S. Lewis. Optimality, identifiablity, and sensitivity. arXiv preprint arXiv:1207.6628, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[18]
J.C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: com- posite optimization for robust phase retrieval. IMA J. Information and Inference, doi:10.1093/imaiai/iay015, 2018
-
[19]
J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM J. Optim. , 28(4):3229–3259, 2018
work page 2018
-
[20]
Y.C. Eldar and S. Mendelson. Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36(3):473–494, 2014. 27
work page 2014
-
[21]
I.I. Eremin. The relaxation method of solving systems of inequalities with convex func- tions on the left-hand side. Dokl. Akad. Nauk SSSR , 160:994–996, 1965
work page 1965
-
[22]
Restarting accelerated gradient methods with a rough strong convexity estimate
O. Fercoq and Z. Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. arXiv preprint arXiv:1709.02300 , 2017
-
[24]
Robert M. Freund and Haihao Lu. New computational guarantees for solving convex op- timization problems with first order methods, via a function growth condition measure. Mathematical Programming, 170(2):445–477, Aug 2018
work page 2018
- [25]
-
[26]
S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization , 23(4):2061–2089, 2013
work page 2061
-
[27]
J.L. Goffin. On convergence rates of subgradient optimization methods. Math. Pro- gramming, 13(3):329–347, 1977
work page 1977
-
[28]
T. Goldstein and C. Studer. Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory, 64(4):2675–2689, April 2018
work page 2018
-
[29]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016
work page 2016
- [30]
-
[31]
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural In- formation Processing Systems, NIPS’13, pages 315–323, USA, 2013. Curran Associates Inc
work page 2013
-
[32]
P.R. Johnstone and P. Moulin. Faster subgradient methods for functions with h¨ olderian growth. Mathematical Programming, Jan 2019
work page 2019
-
[33]
A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012
work page 2012
-
[34]
A. Kulunchakov and J. Mairal. A generic acceleration framework for stochastic com- posite optimization. arXiv preprint arXiv:1906.01164 , 2019. 28
-
[35]
H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York) . Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability
work page 2003
- [36]
-
[37]
S. Lee and S.J. Wright. Manifold identification in dual averaging for regularized stochas- tic online learning. Journal of Machine Learning Research , 13(Jun):1705–1744, 2012
work page 2012
-
[38]
A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002
work page 2003
-
[39]
X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis , 2018
work page 2018
-
[40]
Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. , 25(2):829–855, 2015
work page 2015
-
[42]
B. S. Mordukhovich. Variational Analysis and Generalized Differentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin, 2006
work page 2006
-
[43]
A. Nedi´ c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications , pages 223–264. Springer, 2001
work page 2001
-
[44]
A.S. Nemirovskii and Yu.E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics , 25(2):21 – 30, 1985
work page 1985
-
[45]
A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in opti- mization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics
work page 1983
- [46]
- [47]
-
[48]
B. O’Donoghue and E. Cand` es. Adaptive restart for accelerated gradient schemes. Found. Comput. Math., 15(3):715–732, 2015
work page 2015
- [49]
-
[50]
J.-P. Penot. Calculus without derivatives, volume 266 of Graduate Texts in Mathematics. Springer, New York, 2013
work page 2013
-
[51]
R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996
work page 1996
-
[52]
B.T. Poljak. Minimization of nonsmooth functionals. ˇZ. Vyˇ cisl. Mat. i Mat. Fiz. , 9:509–521, 1969
work page 1969
-
[53]
B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. , 30(4):838–855, 1992
work page 1992
-
[54]
J. Renegar and B. Grimmer. A simple nearly-optimal restart scheme for speeding-up first order methods. arXiv preprint arXiv:1803.00151 , 2018
-
[55]
H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics , 22:400–407, 1951
work page 1951
-
[56]
R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematis- chen Wissenschaften, Vol 317, Springer, Berlin, 1998
work page 1998
-
[57]
V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems 30 , pages 1119–1129. Curran Associates, Inc., 2017
work page 2017
-
[58]
M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017
work page 2017
-
[59]
Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition
M. Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 , 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[60]
Proximal Stochastic Dual Coordinate Ascent
S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[61]
Y. Shechtman, Y.C. Eldar, O. Cohen, H.N. Chapman, J. Miao, and M. Segev. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3):87–109, May 2015
work page 2015
-
[62]
N.Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, 2012
work page 2012
- [63]
-
[64]
S.J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993
work page 1993
-
[65]
L. Xiao. Dual averaging methods for regularized stochastic learning and online opti- mization. Journal of Machine Learning Research , 11(Oct):2543–2596, 2010. 30
work page 2010
- [66]
-
[67]
T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research , 19(1):236–268, 2018
work page 2018
-
[68]
Stagewise Training Accelerates Convergence of Testing Error Over SGD
T. Yang, Y. Yan, Z. Yuan, and R. Jin. Why does stagewise training accelerate conver- gence of testing error over sgd? arXiv preprint arXiv:1812.03934 , 2018. A Proofs from Section 3.2 We will need the following elementary lemma. Lemma A.1. Suppose Assumptions (A4) and (A5) hold. Fix an arbitrary γ∈ (0, 2) and consider two points y∈T γ and x∈ Rd. Then the ...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[69]
(smoothness) The setS is a C2-smooth manifold around ¯x and the restriction f ⏐⏐ S is C2-smooth near ¯x
-
[70]
(finite identification) For any sequences ( xi,f (xi),vi) → (¯x,f (¯x), ¯v) with vi ∈ ∂Lf(xi), the points xi must all lie inS for all sufficiently large indices i. Let us first observe that under a very mild condition on the function f, identifiability at a critical points implies local sharp growth. Theorem D.2 (Identification implies sharpness) . Consider a cl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.