pith. sign in

arxiv: 1907.09547 · v1 · pith:SOIOTEWTnew · submitted 2019-07-22 · 🧮 math.OC · cs.LG· stat.ML

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Pith reviewed 2026-05-24 17:50 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords stochastic optimizationgeometric step decaysharp functionsnonconvex optimizationlinear convergencephase retrievalblind deconvolution
0
0 comments X

The pith

Geometric step decay endows stochastic algorithms with local linear convergence on sharp nonconvex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that halving the step size after every few epochs allows standard stochastic methods to achieve local linear rates of convergence to global minimizers for sharp nonsmooth nonconvex functions. These functions appear in statistical recovery tasks, where only sublinear rates were previously available from polynomial decay schedules. The result covers the stochastic projected subgradient, proximal point, and prox-linear algorithms and handles the added difficulty of bounding the probability that iterates escape the local basin. Applications recover matching guarantees for phase retrieval and blind deconvolution under Gaussian models and produce new ones under heavy-tailed noise.

Core claim

For a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows the stochastic projected subgradient, proximal point, and prox-linear algorithms with a local linear rate of convergence to global minimizers.

What carries the argument

Geometric step decay schedule that halves the step size after every few epochs.

If this is right

  • The stochastic projected subgradient method converges linearly to global minimizers on sharp nonconvex problems.
  • The proximal point and prox-linear algorithms inherit the same local linear rate under geometric decay.
  • Phase retrieval and blind deconvolution obtain matching or improved guarantees under both Gaussian and heavy-tailed measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local character of the result implies that a preliminary global phase may be needed to enter the basin before decay begins.
  • The same schedule could simplify step-size selection in other nonconvex statistical estimation problems that satisfy sharpness.
  • Heavy-tailed guarantees suggest the approach may tolerate heavier noise than typical analyses assume.

Load-bearing premise

Iterates remain inside a local neighborhood of the minimizer with high probability and do not escape.

What would settle it

A concrete sharp nonconvex instance in which the algorithm with geometric decay escapes the basin and the observed rate becomes sublinear.

Figures

Figures reproduced from arXiv: 1907.09547 by Damek Davis, Dmitriy Drusvyatskiy, Vasileios Charisopoulos.

Figure 1
Figure 1. Figure 1: Illustration of a one-sided model: f(x) = |x 2 − 1|, f0.5(y) = |1.25 − y| Lipschitz continuity We assume that the models are Lipschitz on a tube surrounding S. (A5) (Lipschitz property) Define the tube Tγ :=  x ∈ X | dist(x, S) ≤ γµ η  ∀γ > 0. We assume that there exists a measurable function L: R d × Ω → R+ such that min v∈∂fx(x,z) kvk ≤ L(x, z) (2.5) for all x ∈ T2 and a.e. z ∈ Ω. Moreover, we assume t… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence behavior for d = 100 with finite sample size m = 8·d. Phase Retrieval (left column), Blind Deconvolution (right column), pfail = 0.0 (top row), pfail = 0.2 (bottom row). Average over 10 runs. and prox-linear models agree up to an error that increases quadratically as we move from the basepoint, and the proximal subproblems force iterates to remain near the basepoint. Running the proximal point … view at source ↗
Figure 3
Figure 3. Figure 3: Convergence behavior for d = 100 with streaming data. Phase Retrieval (left column), Blind Deconvolution (right column), pfail = 0.0 (top row), pfail = 0.2 (bottom row). Average over 10 runs. is sparse and sign(hw, xii + b) = sign(yi) for most i. To find z, we minimize the function min z 1 N X N i=1 f(z;i) + τkwk1, where each component is a logistic loss: f(z;i) ≡ f(w, b;i) := log (1 + exp (−yi(hw, xii + b… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity to step size for the phase retrieval problem with [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to step size for the blind convolution problem with [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of RMBA vs. RDA on the sparse logistic regression problem. Left: [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that geometric step decay schedules allow stochastic projected subgradient, proximal point, and prox-linear algorithms to achieve local linear convergence to global minimizers on a broad class of sharp, nonsmooth, nonconvex problems. The analysis is applied to phase retrieval and blind deconvolution, recovering optimal rates under Gaussian measurements and providing new rates under heavy-tailed noise.

Significance. If the escape-probability control holds, the result meaningfully extends the geometric-decay linear-rate phenomenon from the convex sharp case to nonconvex settings that arise in statistical recovery. The local nature of the basin and the explicit handling of stochastic escape are the novel technical elements.

major comments (2)
  1. [Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.
  2. [§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.
minor comments (2)
  1. [§2] The definition of the sharpness constant and the radius of the local basin should be stated once in a single display equation rather than re-introduced in each theorem.
  2. [Figure 1] Figure 1 caption does not indicate whether the plotted trajectories are single runs or averaged; this affects interpretation of the observed linear phase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying these two technical points that require clarification. Both comments concern the rigor of the escape-probability control, which is central to extending geometric decay to the nonconvex setting. We address each below and will incorporate the requested explicit statements in the revision.

read point-by-point responses
  1. Referee: [Theorem 3.1 and surrounding discussion of the local basin] The central guarantee is local linear convergence inside a basin; the manuscript must therefore supply an explicit bound showing that the total probability of escape over the entire (infinite) horizon remains small (e.g., <1/2) even after repeated step-size halvings. No such quantitative escape-probability statement appears in the main theorem or its proof sketch.

    Authors: We agree that an explicit quantitative bound on the cumulative escape probability strengthens the result. While the proof of Theorem 3.1 already controls the per-epoch escape probability via a union bound whose terms decrease geometrically with the step-size halvings, the infinite-horizon summation is only implicit. In the revised version we will insert a new corollary (Corollary 3.2) immediately after Theorem 3.1 that sums the geometric series of per-phase escape probabilities and shows that, for any prescribed δ > 0, the initial step size and epoch lengths can be chosen so that the total escape probability over the infinite horizon is at most δ (in particular < 1/2). The argument uses only the same per-epoch tail bound already present in the proof and the geometric decay of the step sizes. revision: yes

  2. Referee: [§5.2 (heavy-tailed phase retrieval)] For the heavy-tailed application the noise model is weaker than sub-Gaussian; the supermartingale or union-bound argument used to control escape must be re-verified under only moment assumptions, yet the proof appears to reuse the same concentration inequalities derived for the Gaussian case.

    Authors: We acknowledge the need for explicit verification under moment assumptions. The escape control in §5.2 is based on a supermartingale inequality that depends solely on the second-moment bound stated in Assumption 5.3; this inequality was originally derived for the Gaussian case but holds verbatim under the weaker moment condition. Nevertheless, to remove any ambiguity we will revise the proof text to cite only the moment-based supermartingale bound, add a short remark confirming that no sub-Gaussian tail inequalities are invoked, and verify that the same union-bound argument over epochs continues to apply. revision: yes

Circularity Check

0 steps flagged

No circularity: new local analysis for nonconvex sharp case stands independently of convex prior.

full rationale

The abstract and available text present a new result extending geometric step decay to stochastic sharp nonconvex problems, explicitly noting the additional challenge of bounding escape probability from the local basin. No equations, fitted parameters, or self-citations are shown reducing the claimed linear rate to a definition or prior input by construction. The convex sharp case is cited as prior work but is not load-bearing for the nonconvex extension, which requires fresh supermartingale or union-bound arguments under noise. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the sharpness condition for nonconvex functions and standard stochastic oracle assumptions; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Objective functions satisfy a sharpness condition that guarantees linear growth away from minimizers.
    This condition is invoked to obtain linear rates and is the key domain assumption distinguishing the setting from general nonconvex problems.
  • domain assumption Stochastic oracles satisfy bounded variance or moment conditions sufficient for concentration.
    Required to control the probability of escape from the local basin.

pith-pipeline@v0.9.0 · 5779 in / 1207 out tokens · 71230 ms · 2026-05-24T17:50:01.827579+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Convergence of difference inclusions via a diameter criterion

    math.OC 2026-05 unverdicted novelty 7.0

    A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Abadi, A

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on het- erogeneous systems, 2015. Software available from tensorflow. org , 1(2), 2015

  2. [2]

    Ahmed, B

    A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programming. IEEE Transactions on Information Theory , 60(3):1711–1732, 2014

  3. [3]

    Asi and J.C

    H. Asi and J.C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. arXiv preprint arXiv:1810.05633 , 2018

  4. [4]

    Asi and J.C

    H. Asi and J.C. Duchi. The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619 , 2019

  5. [5]

    Aybat, A

    N.S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022 , 2019

  6. [6]

    Cand` es, T

    E.J. Cand` es, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recov- ery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. 26

  7. [7]

    Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

    V. Charisopoulos, Y. Chen, D. Davis, M. D´ ıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020 , 2019

  8. [8]

    Composite optimization for robust blind deconvolution

    V. Charisopoulos, D. Davis, M. D´ ıaz, and D. Drusvyatskiy. Composite optimization for robust blind deconvolution. arXiv:1901.01624, 2019

  9. [9]

    Geometric step decay: reference implementation

    COR-OPT. Geometric step decay: reference implementation. https://github.com/ COR-OPT/GeomStepDecay, 2019

  10. [10]

    D. Davis. SMART: The stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698, 2016

  11. [11]

    Davis and D

    D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019

  12. [12]

    Davis, D

    D. Davis, D. Drusvyatskiy, K.J. MacPhee, and C. Paquette. Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. , 179(3):962–982, 2018

  13. [13]

    The nonsmooth landscape of phase retrieval

    D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase re- trieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 , 2017

  14. [14]

    Defazio, F

    A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 1646–1654. Curran Associates, Inc., 2014

  15. [15]

    Defazio, J

    A. Defazio, J. Domke, and T.S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning , volume 32 of Proceedings of Machine Learning Research, pages 1125–1133, Bejing, China, 22–24 Jun

  16. [16]

    Drusvyatskiy

    D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News , 26(1), 2018

  17. [17]

    Optimality, identifiability, and sensitivity

    D. Drusvyatskiy and A. S. Lewis. Optimality, identifiablity, and sensitivity. arXiv preprint arXiv:1207.6628, 2012

  18. [18]

    Duchi and F

    J.C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: com- posite optimization for robust phase retrieval. IMA J. Information and Inference, doi:10.1093/imaiai/iay015, 2018

  19. [19]

    Duchi and F

    J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM J. Optim. , 28(4):3229–3259, 2018

  20. [20]

    Eldar and S

    Y.C. Eldar and S. Mendelson. Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36(3):473–494, 2014. 27

  21. [21]

    I.I. Eremin. The relaxation method of solving systems of inequalities with convex func- tions on the left-hand side. Dokl. Akad. Nauk SSSR , 160:994–996, 1965

  22. [22]

    Restarting accelerated gradient methods with a rough strong convexity estimate

    O. Fercoq and Z. Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358 , 2016

  23. [23]

    Fercoq and Z

    O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. arXiv preprint arXiv:1709.02300 , 2017

  24. [24]

    Freund and Haihao Lu

    Robert M. Freund and Haihao Lu. New computational guarantees for solving convex op- timization problems with first order methods, via a function growth condition measure. Mathematical Programming, 170(2):445–477, Aug 2018

  25. [25]

    R. Ge, S.M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near op- timal, geometrically decaying learning rate procedure. arXiv preprint arXiv:1904.12838, 2019

  26. [26]

    Ghadimi and G

    S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly con- vex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization , 23(4):2061–2089, 2013

  27. [27]

    J.L. Goffin. On convergence rates of subgradient optimization methods. Math. Pro- gramming, 13(3):329–347, 1977

  28. [28]

    Goldstein and C

    T. Goldstein and C. Studer. Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory, 64(4):2675–2689, April 2018

  29. [29]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

  30. [30]

    Hsu and S

    D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research , 17(1):543–582, 2016

  31. [31]

    Johnson and T

    R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural In- formation Processing Systems, NIPS’13, pages 315–323, USA, 2013. Curran Associates Inc

  32. [32]

    Johnstone and P

    P.R. Johnstone and P. Moulin. Faster subgradient methods for functions with h¨ olderian growth. Mathematical Programming, Jan 2019

  33. [33]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

  34. [34]

    Kulunchakov and J

    A. Kulunchakov and J. Mairal. A generic acceleration framework for stochastic com- posite optimization. arXiv preprint arXiv:1906.01164 , 2019. 28

  35. [35]

    Kushner and G.G

    H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York) . Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

  36. [36]

    Lecun, L

    Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998

  37. [37]

    Lee and S.J

    S. Lee and S.J. Wright. Manifold identification in dual averaging for regularized stochas- tic online learning. Journal of Machine Learning Research , 13(Jun):1705–1744, 2012

  38. [38]

    A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002

  39. [39]

    X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis , 2018

  40. [40]

    Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286, 2018

  41. [41]

    J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. , 25(2):829–855, 2015

  42. [42]

    B. S. Mordukhovich. Variational Analysis and Generalized Differentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin, 2006

  43. [43]

    Nedi´ c and D

    A. Nedi´ c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications , pages 223–264. Springer, 2001

  44. [44]

    Nemirovskii and Yu.E

    A.S. Nemirovskii and Yu.E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics , 25(2):21 – 30, 1985

  45. [45]

    Nemirovsky and D.B

    A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in opti- mization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics

  46. [46]

    Nesterov

    Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009

  47. [47]

    Nesterov

    Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR , 269(3):543–547, 1983

  48. [48]

    O’Donoghue and E

    B. O’Donoghue and E. Cand` es. Adaptive restart for accelerated gradient schemes. Found. Comput. Math., 15(3):715–732, 2015

  49. [49]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. 29

  50. [50]

    J.-P. Penot. Calculus without derivatives, volume 266 of Graduate Texts in Mathematics. Springer, New York, 2013

  51. [51]

    Poliquin and R.T

    R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996

  52. [52]

    B.T. Poljak. Minimization of nonsmooth functionals. ˇZ. Vyˇ cisl. Mat. i Mat. Fiz. , 9:509–521, 1969

  53. [53]

    Polyak and A.B

    B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. , 30(4):838–855, 1992

  54. [54]

    Renegar and B

    J. Renegar and B. Grimmer. A simple nearly-optimal restart scheme for speeding-up first order methods. arXiv preprint arXiv:1803.00151 , 2018

  55. [55]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics , 22:400–407, 1951

  56. [56]

    Rockafellar and R.J-B

    R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematis- chen Wissenschaften, Vol 317, Springer, Berlin, 1998

  57. [57]

    Roulet and A

    V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems 30 , pages 1119–1129. Curran Associates, Inc., 2017

  58. [58]

    Schmidt, N

    M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017

  59. [59]

    Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

    M. Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 , 2013

  60. [60]

    Proximal Stochastic Dual Coordinate Ascent

    S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012

  61. [61]

    Shechtman, Y.C

    Y. Shechtman, Y.C. Eldar, O. Cohen, H.N. Chapman, J. Miao, and M. Segev. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3):87–109, May 2015

  62. [62]

    N.Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, 2012

  63. [63]

    Tan and R

    Y.S. Tan and R. Vershynin. Phase retrieval via randomized kaczmarz: Theoretical guarantees. Information and Inference: A Journal of the IMA , 8(1):97–123, 2018

  64. [64]

    S.J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993

  65. [65]

    L. Xiao. Dual averaging methods for regularized stochastic learning and online opti- mization. Journal of Machine Learning Research , 11(Oct):2543–2596, 2010. 30

  66. [66]

    Y. Xu, Q. Lin, and T. Yang. Accelerated stochastic subgradient methods under local error bound condition. arXiv preprint arXiv:1607.01027 , 2016

  67. [67]

    Yang and Q

    T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research , 19(1):236–268, 2018

  68. [68]

    Stagewise Training Accelerates Convergence of Testing Error Over SGD

    T. Yang, Y. Yan, Z. Yuan, and R. Jin. Why does stagewise training accelerate conver- gence of testing error over sgd? arXiv preprint arXiv:1812.03934 , 2018. A Proofs from Section 3.2 We will need the following elementary lemma. Lemma A.1. Suppose Assumptions (A4) and (A5) hold. Fix an arbitrary γ∈ (0, 2) and consider two points y∈T γ and x∈ Rd. Then the ...

  69. [69]

    (smoothness) The setS is a C2-smooth manifold around ¯x and the restriction f ⏐⏐ S is C2-smooth near ¯x

  70. [70]

    Let us first observe that under a very mild condition on the function f, identifiability at a critical points implies local sharp growth

    (finite identification) For any sequences ( xi,f (xi),vi) → (¯x,f (¯x), ¯v) with vi ∈ ∂Lf(xi), the points xi must all lie inS for all sufficiently large indices i. Let us first observe that under a very mild condition on the function f, identifiability at a critical points implies local sharp growth. Theorem D.2 (Identification implies sharpness) . Consider a cl...