Stochastic algorithms with geometric step decay converge linearly on sharp functions

Damek Davis; Dmitriy Drusvyatskiy; Vasileios Charisopoulos

arxiv: 1907.09547 · v1 · pith:SOIOTEWTnew · submitted 2019-07-22 · 🧮 math.OC · cs.LG· stat.ML

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Damek Davis , Dmitriy Drusvyatskiy , Vasileios Charisopoulos This is my paper

Pith reviewed 2026-05-24 17:50 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords stochastic optimizationgeometric step decaysharp functionsnonconvex optimizationlinear convergencephase retrievalblind deconvolution

0 comments

The pith

Geometric step decay endows stochastic algorithms with local linear convergence on sharp nonconvex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that halving the step size after every few epochs allows standard stochastic methods to achieve local linear rates of convergence to global minimizers for sharp nonsmooth nonconvex functions. These functions appear in statistical recovery tasks, where only sublinear rates were previously available from polynomial decay schedules. The result covers the stochastic projected subgradient, proximal point, and prox-linear algorithms and handles the added difficulty of bounding the probability that iterates escape the local basin. Applications recover matching guarantees for phase retrieval and blind deconvolution under Gaussian models and produce new ones under heavy-tailed noise.

Core claim

For a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows the stochastic projected subgradient, proximal point, and prox-linear algorithms with a local linear rate of convergence to global minimizers.

What carries the argument

Geometric step decay schedule that halves the step size after every few epochs.

If this is right

The stochastic projected subgradient method converges linearly to global minimizers on sharp nonconvex problems.
The proximal point and prox-linear algorithms inherit the same local linear rate under geometric decay.
Phase retrieval and blind deconvolution obtain matching or improved guarantees under both Gaussian and heavy-tailed measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local character of the result implies that a preliminary global phase may be needed to enter the basin before decay begins.
The same schedule could simplify step-size selection in other nonconvex statistical estimation problems that satisfy sharpness.
Heavy-tailed guarantees suggest the approach may tolerate heavier noise than typical analyses assume.

Load-bearing premise

Iterates remain inside a local neighborhood of the minimizer with high probability and do not escape.

What would settle it

A concrete sharp nonconvex instance in which the algorithm with geometric decay escapes the basin and the observed rate becomes sublinear.

Figures

Figures reproduced from arXiv: 1907.09547 by Damek Davis, Dmitriy Drusvyatskiy, Vasileios Charisopoulos.

**Figure 1.** Figure 1: Illustration of a one-sided model: f(x) = |x 2 − 1|, f0.5(y) = |1.25 − y| Lipschitz continuity We assume that the models are Lipschitz on a tube surrounding S. (A5) (Lipschitz property) Define the tube Tγ := x ∈ X | dist(x, S) ≤ γµ η ∀γ > 0. We assume that there exists a measurable function L: R d × Ω → R+ such that min v∈∂fx(x,z) kvk ≤ L(x, z) (2.5) for all x ∈ T2 and a.e. z ∈ Ω. Moreover, we assume t… view at source ↗

**Figure 2.** Figure 2: Convergence behavior for d = 100 with finite sample size m = 8·d. Phase Retrieval (left column), Blind Deconvolution (right column), pfail = 0.0 (top row), pfail = 0.2 (bottom row). Average over 10 runs. and prox-linear models agree up to an error that increases quadratically as we move from the basepoint, and the proximal subproblems force iterates to remain near the basepoint. Running the proximal point … view at source ↗

**Figure 3.** Figure 3: Convergence behavior for d = 100 with streaming data. Phase Retrieval (left column), Blind Deconvolution (right column), pfail = 0.0 (top row), pfail = 0.2 (bottom row). Average over 10 runs. is sparse and sign(hw, xii + b) = sign(yi) for most i. To find z, we minimize the function min z 1 N X N i=1 f(z;i) + τkwk1, where each component is a logistic loss: f(z;i) ≡ f(w, b;i) := log (1 + exp (−yi(hw, xii + b… view at source ↗

**Figure 4.** Figure 4: Sensitivity to step size for the phase retrieval problem with [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to step size for the blind convolution problem with [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of RMBA vs. RDA on the sparse logistic regression problem. Left: [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends geometric step decay to give local linear rates on sharp nonconvex stochastic problems and adds heavy-tailed guarantees, but the escape-probability control is the part that actually makes the nonconvex claim work.

read the letter

The core advance is showing that halving the step size every few epochs gives local linear convergence to global minimizers for stochastic projected subgradient, proximal point, and prox-linear methods on sharp nonconvex nonsmooth functions. This is a direct lift from the earlier convex sharp case, but the nonconvex setting adds the need to keep the iterates from leaving the basin with high probability. They handle that and then apply the result to phase retrieval and blind deconvolution, recovering the best Gaussian rates and getting new ones under heavy-tailed noise. That last part is useful because many statistical problems have heavier tails than the usual sub-Gaussian assumptions allow. The analysis appears to rest on new supermartingale or union-bound arguments that survive the geometric schedule and the sharpness constant; if those bounds hold without extra restrictions on gradients far from the minimizer, the claim is solid. The main soft spot is that everything is local, so the practical value depends on how small the total escape probability stays over the whole run. If the bound requires the noise to be small relative to the sharpness parameter or needs bounded gradients outside the basin, the result is narrower than the abstract suggests. The paper is aimed at people who already work on stochastic methods for nonconvex statistical estimation. It is worth sending to referees because the extension is clean, the applications are concrete, and the heavy-tailed part is new enough to matter.

Referee Report

2 major / 2 minor

Summary. The paper claims that geometric step decay schedules allow stochastic projected subgradient, proximal point, and prox-linear algorithms to achieve local linear convergence to global minimizers on a broad class of sharp, nonsmooth, nonconvex problems. The analysis is applied to phase retrieval and blind deconvolution, recovering optimal rates under Gaussian measurements and providing new rates under heavy-tailed noise.

Significance. If the escape-probability control holds, the result meaningfully extends the geometric-decay linear-rate phenomenon from the convex sharp case to nonconvex settings that arise in statistical recovery. The local nature of the basin and the explicit handling of stochastic escape are the novel technical elements.

major comments (2)

[Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.
[§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.

minor comments (2)

[§2] The definition of the sharpness constant and the radius of the local basin should be stated once in a single display equation rather than re-introduced in each theorem.
[Figure 1] Figure 1 caption does not indicate whether the plotted trajectories are single runs or averaged; this affects interpretation of the observed linear phase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying these two technical points that require clarification. Both comments concern the rigor of the escape-probability control, which is central to extending geometric decay to the nonconvex setting. We address each below and will incorporate the requested explicit statements in the revision.

read point-by-point responses

Referee: [Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.

Authors: We agree that an explicit quantitative bound on the cumulative escape probability strengthens the result. While the proof of Theorem 3.1 already controls the per-epoch escape probability via a union bound whose terms decrease geometrically with the step-size halvings, the infinite-horizon summation is only implicit. In the revised version we will insert a new corollary (Corollary 3.2) immediately after Theorem 3.1 that sums the geometric series of per-phase escape probabilities and shows that, for any prescribed δ > 0, the initial step size and epoch lengths can be chosen so that the total escape probability over the infinite horizon is at most δ (in particular < 1/2). The argument uses only the same per-epoch tail bound already present in the proof and the geometric decay of the step sizes. revision: yes
Referee: [§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.

Authors: We acknowledge the need for explicit verification under moment assumptions. The escape control in §5.2 is based on a supermartingale inequality that depends solely on the second-moment bound stated in Assumption 5.3; this inequality was originally derived for the Gaussian case but holds verbatim under the weaker moment condition. Nevertheless, to remove any ambiguity we will revise the proof text to cite only the moment-based supermartingale bound, add a short remark confirming that no sub-Gaussian tail inequalities are invoked, and verify that the same union-bound argument over epochs continues to apply. revision: yes

Circularity Check

0 steps flagged

No circularity: new local analysis for nonconvex sharp case stands independently of convex prior.

full rationale

The abstract and available text present a new result extending geometric step decay to stochastic sharp nonconvex problems, explicitly noting the additional challenge of bounding escape probability from the local basin. No equations, fitted parameters, or self-citations are shown reducing the claimed linear rate to a definition or prior input by construction. The convex sharp case is cited as prior work but is not load-bearing for the nonconvex extension, which requires fresh supermartingale or union-bound arguments under noise. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the sharpness condition for nonconvex functions and standard stochastic oracle assumptions; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Objective functions satisfy a sharpness condition that guarantees linear growth away from minimizers.
This condition is invoked to obtain linear rates and is the key domain assumption distinguishing the setting from general nonconvex problems.
domain assumption Stochastic oracles satisfy bounded variance or moment conditions sufficient for concentration.
Required to control the probability of escape from the local basin.

pith-pipeline@v0.9.0 · 5779 in / 1207 out tokens · 71230 ms · 2026-05-24T17:50:01.827579+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergence of difference inclusions via a diameter criterion
math.OC 2026-05 unverdicted novelty 7.0

A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on het- erogeneous systems, 2015. Software available from tensorﬂow. org , 1(2), 2015

work page 2015
[2]

Ahmed, B

A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programming. IEEE Transactions on Information Theory , 60(3):1711–1732, 2014

work page 2014
[3]

Asi and J.C

H. Asi and J.C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. arXiv preprint arXiv:1810.05633 , 2018

work page arXiv 2018
[4]

Asi and J.C

H. Asi and J.C. Duchi. The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619 , 2019

work page arXiv 1903
[5]

Aybat, A

N.S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022 , 2019

work page arXiv 1901
[6]

Cand` es, T

E.J. Cand` es, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recov- ery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. 26

work page 2013
[7]

Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

V. Charisopoulos, Y. Chen, D. Davis, M. D´ ıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Composite optimization for robust blind deconvolution

V. Charisopoulos, D. Davis, M. D´ ıaz, and D. Drusvyatskiy. Composite optimization for robust blind deconvolution. arXiv:1901.01624, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

Geometric step decay: reference implementation

COR-OPT. Geometric step decay: reference implementation. https://github.com/ COR-OPT/GeomStepDecay, 2019

work page 2019
[10]

D. Davis. SMART: The stochastic monotone aggregated root-ﬁnding algorithm. arXiv preprint arXiv:1601.00698, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019

work page 2019
[12]

Davis, D

D. Davis, D. Drusvyatskiy, K.J. MacPhee, and C. Paquette. Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. , 179(3):962–982, 2018

work page 2018
[13]

The nonsmooth landscape of phase retrieval

D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase re- trieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 1646–1654. Curran Associates, Inc., 2014

work page 2014
[15]

Defazio, J

A. Defazio, J. Domke, and T.S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning , volume 32 of Proceedings of Machine Learning Research, pages 1125–1133, Bejing, China, 22–24 Jun

work page
[16]

Drusvyatskiy

D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News , 26(1), 2018

work page 2018
[17]

Optimality, identifiability, and sensitivity

D. Drusvyatskiy and A. S. Lewis. Optimality, identiﬁablity, and sensitivity. arXiv preprint arXiv:1207.6628, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[18]

Duchi and F

J.C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: com- posite optimization for robust phase retrieval. IMA J. Information and Inference, doi:10.1093/imaiai/iay015, 2018

work page doi:10.1093/imaiai/iay015 2018
[19]

Duchi and F

J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM J. Optim. , 28(4):3229–3259, 2018

work page 2018
[20]

Eldar and S

Y.C. Eldar and S. Mendelson. Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36(3):473–494, 2014. 27

work page 2014
[21]

I.I. Eremin. The relaxation method of solving systems of inequalities with convex func- tions on the left-hand side. Dokl. Akad. Nauk SSSR , 160:994–996, 1965

work page 1965
[22]

Restarting accelerated gradient methods with a rough strong convexity estimate

O. Fercoq and Z. Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Fercoq and Z

O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. arXiv preprint arXiv:1709.02300 , 2017

work page arXiv 2017
[24]

Freund and Haihao Lu

Robert M. Freund and Haihao Lu. New computational guarantees for solving convex op- timization problems with ﬁrst order methods, via a function growth condition measure. Mathematical Programming, 170(2):445–477, Aug 2018

work page 2018
[25]

R. Ge, S.M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near op- timal, geometrically decaying learning rate procedure. arXiv preprint arXiv:1904.12838, 2019

work page arXiv 1904
[26]

Ghadimi and G

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization , 23(4):2061–2089, 2013

work page 2061
[27]

J.L. Goﬃn. On convergence rates of subgradient optimization methods. Math. Pro- gramming, 13(3):329–347, 1977

work page 1977
[28]

Goldstein and C

T. Goldstein and C. Studer. Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory, 64(4):2675–2689, April 2018

work page 2018
[29]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[30]

Hsu and S

D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research , 17(1):543–582, 2016

work page 2016
[31]

Johnson and T

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural In- formation Processing Systems, NIPS’13, pages 315–323, USA, 2013. Curran Associates Inc

work page 2013
[32]

Johnstone and P

P.R. Johnstone and P. Moulin. Faster subgradient methods for functions with h¨ olderian growth. Mathematical Programming, Jan 2019

work page 2019
[33]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classiﬁcation with deep convo- lutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

work page 2012
[34]

Kulunchakov and J

A. Kulunchakov and J. Mairal. A generic acceleration framework for stochastic com- posite optimization. arXiv preprint arXiv:1906.01164 , 2019. 28

work page arXiv 1906
[35]

Kushner and G.G

H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York) . Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

work page 2003
[36]

Lecun, L

Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998

work page 1998
[37]

Lee and S.J

S. Lee and S.J. Wright. Manifold identiﬁcation in dual averaging for regularized stochas- tic online learning. Journal of Machine Learning Research , 13(Jun):1705–1744, 2012

work page 2012
[38]

A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002

work page 2003
[39]

X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis , 2018

work page 2018
[40]

Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. , 25(2):829–855, 2015

work page 2015
[42]

B. S. Mordukhovich. Variational Analysis and Generalized Diﬀerentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin, 2006

work page 2006
[43]

Nedi´ c and D

A. Nedi´ c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications , pages 223–264. Springer, 2001

work page 2001
[44]

Nemirovskii and Yu.E

A.S. Nemirovskii and Yu.E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics , 25(2):21 – 30, 1985

work page 1985
[45]

Nemirovsky and D.B

A.S. Nemirovsky and D.B. Yudin. Problem complexity and method eﬃciency in opti- mization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics

work page 1983
[46]

Nesterov

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009

work page 2009
[47]

Nesterov

Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR , 269(3):543–547, 1983

work page 1983
[48]

O’Donoghue and E

B. O’Donoghue and E. Cand` es. Adaptive restart for accelerated gradient schemes. Found. Comput. Math., 15(3):715–732, 2015

work page 2015
[49]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic diﬀerentiation in pytorch. In NIPS-W, 2017. 29

work page 2017
[50]

J.-P. Penot. Calculus without derivatives, volume 266 of Graduate Texts in Mathematics. Springer, New York, 2013

work page 2013
[51]

Poliquin and R.T

R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996

work page 1996
[52]

B.T. Poljak. Minimization of nonsmooth functionals. ˇZ. Vyˇ cisl. Mat. i Mat. Fiz. , 9:509–521, 1969

work page 1969
[53]

Polyak and A.B

B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. , 30(4):838–855, 1992

work page 1992
[54]

Renegar and B

J. Renegar and B. Grimmer. A simple nearly-optimal restart scheme for speeding-up ﬁrst order methods. arXiv preprint arXiv:1803.00151 , 2018

work page arXiv 2018
[55]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics , 22:400–407, 1951

work page 1951
[56]

Rockafellar and R.J-B

R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematis- chen Wissenschaften, Vol 317, Springer, Berlin, 1998

work page 1998
[57]

Roulet and A

V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems 30 , pages 1119–1129. Curran Associates, Inc., 2017

work page 2017
[58]

Schmidt, N

M. Schmidt, N. Le Roux, and F. Bach. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017

work page 2017
[59]

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

M. Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[60]

Proximal Stochastic Dual Coordinate Ascent

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[61]

Shechtman, Y.C

Y. Shechtman, Y.C. Eldar, O. Cohen, H.N. Chapman, J. Miao, and M. Segev. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3):87–109, May 2015

work page 2015
[62]

N.Z. Shor. Minimization methods for non-diﬀerentiable functions , volume 3. Springer Science & Business Media, 2012

work page 2012
[63]

Tan and R

Y.S. Tan and R. Vershynin. Phase retrieval via randomized kaczmarz: Theoretical guarantees. Information and Inference: A Journal of the IMA , 8(1):97–123, 2018

work page 2018
[64]

S.J. Wright. Identiﬁable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993

work page 1993
[65]

L. Xiao. Dual averaging methods for regularized stochastic learning and online opti- mization. Journal of Machine Learning Research , 11(Oct):2543–2596, 2010. 30

work page 2010
[66]

Y. Xu, Q. Lin, and T. Yang. Accelerated stochastic subgradient methods under local error bound condition. arXiv preprint arXiv:1607.01027 , 2016

work page arXiv 2016
[67]

Yang and Q

T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research , 19(1):236–268, 2018

work page 2018
[68]

Stagewise Training Accelerates Convergence of Testing Error Over SGD

T. Yang, Y. Yan, Z. Yuan, and R. Jin. Why does stagewise training accelerate conver- gence of testing error over sgd? arXiv preprint arXiv:1812.03934 , 2018. A Proofs from Section 3.2 We will need the following elementary lemma. Lemma A.1. Suppose Assumptions (A4) and (A5) hold. Fix an arbitrary γ∈ (0, 2) and consider two points y∈T γ and x∈ Rd. Then the ...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[69]

(smoothness) The setS is a C2-smooth manifold around ¯x and the restriction f ⏐⏐ S is C2-smooth near ¯x

work page
[70]

Let us ﬁrst observe that under a very mild condition on the function f, identiﬁability at a critical points implies local sharp growth

(ﬁnite identiﬁcation) For any sequences ( xi,f (xi),vi) → (¯x,f (¯x), ¯v) with vi ∈ ∂Lf(xi), the points xi must all lie inS for all suﬃciently large indices i. Let us ﬁrst observe that under a very mild condition on the function f, identiﬁability at a critical points implies local sharp growth. Theorem D.2 (Identiﬁcation implies sharpness) . Consider a cl...

work page

[1] [1]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on het- erogeneous systems, 2015. Software available from tensorﬂow. org , 1(2), 2015

work page 2015

[2] [2]

Ahmed, B

A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programming. IEEE Transactions on Information Theory , 60(3):1711–1732, 2014

work page 2014

[3] [3]

Asi and J.C

H. Asi and J.C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. arXiv preprint arXiv:1810.05633 , 2018

work page arXiv 2018

[4] [4]

Asi and J.C

H. Asi and J.C. Duchi. The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619 , 2019

work page arXiv 1903

[5] [5]

Aybat, A

N.S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022 , 2019

work page arXiv 1901

[6] [6]

Cand` es, T

E.J. Cand` es, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recov- ery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. 26

work page 2013

[7] [7]

Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

V. Charisopoulos, Y. Chen, D. Davis, M. D´ ıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[8] [8]

Composite optimization for robust blind deconvolution

V. Charisopoulos, D. Davis, M. D´ ıaz, and D. Drusvyatskiy. Composite optimization for robust blind deconvolution. arXiv:1901.01624, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[9] [9]

Geometric step decay: reference implementation

COR-OPT. Geometric step decay: reference implementation. https://github.com/ COR-OPT/GeomStepDecay, 2019

work page 2019

[10] [10]

D. Davis. SMART: The stochastic monotone aggregated root-ﬁnding algorithm. arXiv preprint arXiv:1601.00698, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019

work page 2019

[12] [12]

Davis, D

D. Davis, D. Drusvyatskiy, K.J. MacPhee, and C. Paquette. Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. , 179(3):962–982, 2018

work page 2018

[13] [13]

The nonsmooth landscape of phase retrieval

D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase re- trieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 1646–1654. Curran Associates, Inc., 2014

work page 2014

[15] [15]

Defazio, J

A. Defazio, J. Domke, and T.S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning , volume 32 of Proceedings of Machine Learning Research, pages 1125–1133, Bejing, China, 22–24 Jun

work page

[16] [16]

Drusvyatskiy

D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News , 26(1), 2018

work page 2018

[17] [17]

Optimality, identifiability, and sensitivity

D. Drusvyatskiy and A. S. Lewis. Optimality, identiﬁablity, and sensitivity. arXiv preprint arXiv:1207.6628, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[18] [18]

Duchi and F

J.C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: com- posite optimization for robust phase retrieval. IMA J. Information and Inference, doi:10.1093/imaiai/iay015, 2018

work page doi:10.1093/imaiai/iay015 2018

[19] [19]

Duchi and F

J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM J. Optim. , 28(4):3229–3259, 2018

work page 2018

[20] [20]

Eldar and S

Y.C. Eldar and S. Mendelson. Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36(3):473–494, 2014. 27

work page 2014

[21] [21]

I.I. Eremin. The relaxation method of solving systems of inequalities with convex func- tions on the left-hand side. Dokl. Akad. Nauk SSSR , 160:994–996, 1965

work page 1965

[22] [22]

Restarting accelerated gradient methods with a rough strong convexity estimate

O. Fercoq and Z. Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Fercoq and Z

O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. arXiv preprint arXiv:1709.02300 , 2017

work page arXiv 2017

[24] [24]

Freund and Haihao Lu

Robert M. Freund and Haihao Lu. New computational guarantees for solving convex op- timization problems with ﬁrst order methods, via a function growth condition measure. Mathematical Programming, 170(2):445–477, Aug 2018

work page 2018

[25] [25]

R. Ge, S.M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near op- timal, geometrically decaying learning rate procedure. arXiv preprint arXiv:1904.12838, 2019

work page arXiv 1904

[26] [26]

Ghadimi and G

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization , 23(4):2061–2089, 2013

work page 2061

[27] [27]

J.L. Goﬃn. On convergence rates of subgradient optimization methods. Math. Pro- gramming, 13(3):329–347, 1977

work page 1977

[28] [28]

Goldstein and C

T. Goldstein and C. Studer. Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory, 64(4):2675–2689, April 2018

work page 2018

[29] [29]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016

[30] [30]

Hsu and S

D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research , 17(1):543–582, 2016

work page 2016

[31] [31]

Johnson and T

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural In- formation Processing Systems, NIPS’13, pages 315–323, USA, 2013. Curran Associates Inc

work page 2013

[32] [32]

Johnstone and P

P.R. Johnstone and P. Moulin. Faster subgradient methods for functions with h¨ olderian growth. Mathematical Programming, Jan 2019

work page 2019

[33] [33]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classiﬁcation with deep convo- lutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

work page 2012

[34] [34]

Kulunchakov and J

A. Kulunchakov and J. Mairal. A generic acceleration framework for stochastic com- posite optimization. arXiv preprint arXiv:1906.01164 , 2019. 28

work page arXiv 1906

[35] [35]

Kushner and G.G

H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York) . Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

work page 2003

[36] [36]

Lecun, L

Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998

work page 1998

[37] [37]

Lee and S.J

S. Lee and S.J. Wright. Manifold identiﬁcation in dual averaging for regularized stochas- tic online learning. Journal of Machine Learning Research , 13(Jun):1705–1744, 2012

work page 2012

[38] [38]

A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002

work page 2003

[39] [39]

X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis , 2018

work page 2018

[40] [40]

Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. , 25(2):829–855, 2015

work page 2015

[42] [42]

B. S. Mordukhovich. Variational Analysis and Generalized Diﬀerentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin, 2006

work page 2006

[43] [43]

Nedi´ c and D

A. Nedi´ c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications , pages 223–264. Springer, 2001

work page 2001

[44] [44]

Nemirovskii and Yu.E

A.S. Nemirovskii and Yu.E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics , 25(2):21 – 30, 1985

work page 1985

[45] [45]

Nemirovsky and D.B

A.S. Nemirovsky and D.B. Yudin. Problem complexity and method eﬃciency in opti- mization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics

work page 1983

[46] [46]

Nesterov

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009

work page 2009

[47] [47]

Nesterov

Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR , 269(3):543–547, 1983

work page 1983

[48] [48]

O’Donoghue and E

B. O’Donoghue and E. Cand` es. Adaptive restart for accelerated gradient schemes. Found. Comput. Math., 15(3):715–732, 2015

work page 2015

[49] [49]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic diﬀerentiation in pytorch. In NIPS-W, 2017. 29

work page 2017

[50] [50]

J.-P. Penot. Calculus without derivatives, volume 266 of Graduate Texts in Mathematics. Springer, New York, 2013

work page 2013

[51] [51]

Poliquin and R.T

R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996

work page 1996

[52] [52]

B.T. Poljak. Minimization of nonsmooth functionals. ˇZ. Vyˇ cisl. Mat. i Mat. Fiz. , 9:509–521, 1969

work page 1969

[53] [53]

Polyak and A.B

B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. , 30(4):838–855, 1992

work page 1992

[54] [54]

Renegar and B

J. Renegar and B. Grimmer. A simple nearly-optimal restart scheme for speeding-up ﬁrst order methods. arXiv preprint arXiv:1803.00151 , 2018

work page arXiv 2018

[55] [55]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics , 22:400–407, 1951

work page 1951

[56] [56]

Rockafellar and R.J-B

R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematis- chen Wissenschaften, Vol 317, Springer, Berlin, 1998

work page 1998

[57] [57]

Roulet and A

V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems 30 , pages 1119–1129. Curran Associates, Inc., 2017

work page 2017

[58] [58]

Schmidt, N

M. Schmidt, N. Le Roux, and F. Bach. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017

work page 2017

[59] [59]

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

M. Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[60] [60]

Proximal Stochastic Dual Coordinate Ascent

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[61] [61]

Shechtman, Y.C

Y. Shechtman, Y.C. Eldar, O. Cohen, H.N. Chapman, J. Miao, and M. Segev. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3):87–109, May 2015

work page 2015

[62] [62]

N.Z. Shor. Minimization methods for non-diﬀerentiable functions , volume 3. Springer Science & Business Media, 2012

work page 2012

[63] [63]

Tan and R

Y.S. Tan and R. Vershynin. Phase retrieval via randomized kaczmarz: Theoretical guarantees. Information and Inference: A Journal of the IMA , 8(1):97–123, 2018

work page 2018

[64] [64]

S.J. Wright. Identiﬁable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993

work page 1993

[65] [65]

L. Xiao. Dual averaging methods for regularized stochastic learning and online opti- mization. Journal of Machine Learning Research , 11(Oct):2543–2596, 2010. 30

work page 2010

[66] [66]

Y. Xu, Q. Lin, and T. Yang. Accelerated stochastic subgradient methods under local error bound condition. arXiv preprint arXiv:1607.01027 , 2016

work page arXiv 2016

[67] [67]

Yang and Q

T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research , 19(1):236–268, 2018

work page 2018

[68] [68]

Stagewise Training Accelerates Convergence of Testing Error Over SGD

T. Yang, Y. Yan, Z. Yuan, and R. Jin. Why does stagewise training accelerate conver- gence of testing error over sgd? arXiv preprint arXiv:1812.03934 , 2018. A Proofs from Section 3.2 We will need the following elementary lemma. Lemma A.1. Suppose Assumptions (A4) and (A5) hold. Fix an arbitrary γ∈ (0, 2) and consider two points y∈T γ and x∈ Rd. Then the ...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[69] [69]

(smoothness) The setS is a C2-smooth manifold around ¯x and the restriction f ⏐⏐ S is C2-smooth near ¯x

work page

[70] [70]

Let us ﬁrst observe that under a very mild condition on the function f, identiﬁability at a critical points implies local sharp growth

(ﬁnite identiﬁcation) For any sequences ( xi,f (xi),vi) → (¯x,f (¯x), ¯v) with vi ∈ ∂Lf(xi), the points xi must all lie inS for all suﬃciently large indices i. Let us ﬁrst observe that under a very mild condition on the function f, identiﬁability at a critical points implies local sharp growth. Theorem D.2 (Identiﬁcation implies sharpness) . Consider a cl...

work page