Projected gradient methods for nonconvex and stochastic smooth optimization: new complexities and auto-conditioned stepsizes

Guanghui Lan; Tianjiao Li; Yangyang Xu

arxiv: 2412.14291 · v2 · pith:F35EYBTAnew · submitted 2024-12-18 · 🧮 math.OC · cs.LG· stat.ML

Projected gradient methods for nonconvex and stochastic smooth optimization: new complexities and auto-conditioned stepsizes

Guanghui Lan , Tianjiao Li , Yangyang Xu This is my paper

Pith reviewed 2026-05-23 06:21 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords projected gradient methodsnonconvex optimizationstochastic optimizationauto-conditioned stepsizesLipschitz constant estimationiteration complexitystationary pointsvariance reduction

0 comments

The pith

Auto-conditioned projected gradient methods achieve the same iteration complexity as constant-stepsize PG without needing the Lipschitz constant or line search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops projected gradient methods for minimizing smooth nonconvex functions over convex compact sets. It first gives a new analysis of the constant-stepsize projected gradient method that attains the best-known iteration complexity for finding an approximate stationary point. The main contribution is an auto-conditioned projected gradient variant that estimates the Lipschitz constant from first-order information in prior iterations and shows the underestimation error can be controlled to match the same complexity, without requiring the constant as input or any line search. The framework extends to stochastic projected gradient and variance-reduced variants, delivering new complexity bounds under different oracle models along with auto-conditioned stepsize policies that preserve comparable guarantees.

Core claim

By estimating the Lipschitz constant using first-order information gathered from the previous iterations, and showing that the error caused by underestimating the Lipschitz constant can be properly controlled, the auto-conditioned projected gradient method achieves the same iteration complexity as the constant-stepsize projected gradient method for finding an approximate stationary point of a smooth nonconvex function over a convex compact set, without requiring the input of the Lipschitz constant or any line search procedure. The methods are also generalized to the stochastic setting with new complexity bounds.

What carries the argument

The auto-conditioned projected gradient (AC-PG) variant, which estimates the Lipschitz constant on the fly from previous first-order information while controlling underestimation error.

If this is right

The constant-stepsize PG method attains the best-known iteration complexity for nonconvex smooth optimization over convex sets.
The AC-PG method matches this complexity without knowledge of the Lipschitz constant or use of line search.
Stochastic PG and variance-reduced SPG methods deliver new complexity bounds in different oracle settings.
Auto-conditioned stepsize policies for the stochastic variants achieve comparable convergence guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Implementation becomes simpler for problems where the Lipschitz constant is unknown or expensive to compute in advance.
The error-control technique for underestimation may transfer to adaptive stepsize rules in other first-order methods for nonconvex problems.
These policies could reduce the need for manual stepsize tuning across a range of optimization applications.

Load-bearing premise

The error caused by underestimating the Lipschitz constant can be properly controlled so that it does not degrade the overall convergence rate.

What would settle it

A concrete smooth nonconvex function over a convex set where the underestimation error from the auto-conditioned Lipschitz estimate accumulates and produces strictly worse than the claimed iteration complexity for reaching an epsilon-stationary point.

Figures

Figures reproduced from arXiv: 2412.14291 by Guanghui Lan, Tianjiao Li, Yangyang Xu.

**Figure 2.** Figure 2: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving 10 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 5.** Figure 5: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

read the original abstract

We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the constant-stepsize PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Auto-conditioned PG matches constant-step complexity without knowing L by controlling underestimation error from online estimates, with stochastic extensions claiming new oracle bounds.

read the letter

The paper's central result is that you can run projected gradient on nonconvex smooth problems without knowing the gradient's Lipschitz constant by estimating it from past steps and still keep the standard iteration complexity. The auto-conditioned version (AC-PG) does this by using first-order information from previous iterations and bounding the effect of any underestimation. They begin with a fresh analysis of constant-stepsize PG that reaches the best-known rate for finding approximate stationary points. Building on that, the AC-PG avoids both manual input of the constant and line searches. The stochastic side introduces SPG and VR-SPG with similar auto-conditioned stepsizes and reports new oracle complexities for those settings. This approach is practical because many real problems do not come with an easy way to compute or bound the Lipschitz constant. If the error control holds up in the proofs, it removes a common barrier to using these methods. The work stays within the standard nonconvex analysis framework, which makes the claims easier to evaluate. One area to check is how the stochastic complexity bounds compare to prior results on stochastic projected gradient methods. The abstract calls them new, but the exact improvement depends on the oracle model and variance assumptions. The compactness assumption on the constraint set is standard and not a problem here. This paper is for people who work on first-order methods for nonconvex optimization, particularly those interested in making stepsize selection more automatic. It should be useful for both theory and implementation in machine learning contexts. The paper shows clear engagement with the literature and specific claims, so it deserves a serious referee to verify the derivations and assess the stochastic contributions.

Referee Report

2 major / 2 minor

Summary. The paper provides a new analysis of constant-stepsize projected gradient (PG) methods for smooth nonconvex minimization over convex compact sets, claiming best-known O(1/ε²) iteration complexity for approximate stationarity. It introduces an auto-conditioned PG (AC-PG) variant that estimates the gradient Lipschitz constant L from prior first-order information and controls underestimation error to match the same complexity without inputting L or using line search. Stochastic extensions (SPG, VR-SPG) and their auto-conditioned versions are analyzed with new oracle complexities under different settings.

Significance. If the error-control argument for underestimation holds, the auto-conditioned stepsize removes a practical barrier to applying PG methods while preserving optimal rates, which is a meaningful contribution for implementation. The stochastic complexity results, if tighter than prior work, would also be useful. No machine-checked proofs or reproducible code are mentioned, but the parameter-free derivation via auto-conditioning is a clear strength if rigorously established.

major comments (2)

[§3] Abstract and §3 (AC-PG construction): the claim that underestimation error from the Lipschitz estimate can be controlled without degrading the O(1/ε²) rate is the central technical step, yet the specific descent inequality or potential-function argument that absorbs the error term while preserving the complexity is not visible in the provided description; this must be made explicit with the precise bound on the accumulated error.
[§2] Theorem on constant-stepsize PG (likely §2): the 'best-known' iteration complexity is asserted, but the precise dependence on the diameter of the constraint set, smoothness constant, and target accuracy ε should be compared explicitly to the prior reference (e.g., the exact big-O expression) to confirm improvement or equivalence.

minor comments (2)

[§3] Notation for the estimated Lipschitz constant (e.g., L_k) should be introduced with a clear recursive definition and boundedness argument early in the AC-PG section.
[§5] The stochastic sections would benefit from a table summarizing the oracle complexities (gradient evaluations, variance-reduced calls) across SPG, VR-SPG, and their auto-conditioned variants for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestions. We address the two major comments below and will revise the manuscript to improve clarity on the technical arguments and comparisons.

read point-by-point responses

Referee: [§3] Abstract and §3 (AC-PG construction): the claim that underestimation error from the Lipschitz estimate can be controlled without degrading the O(1/ε²) rate is the central technical step, yet the specific descent inequality or potential-function argument that absorbs the error term while preserving the complexity is not visible in the provided description; this must be made explicit with the precise bound on the accumulated error.

Authors: We agree that the error-control argument requires explicit highlighting. The full analysis in Section 3 derives a descent inequality for the AC-PG iterates by using the underestimated stepsize η_k = 1/L_k, where L_k is the running estimate. The resulting inequality includes an additive error term bounded by the sum of ||g_{t} - g_{t-1}||² over prior steps; this sum is controlled via a telescoping argument that absorbs into the potential function without changing the leading O(1/ε²) term. We will add a dedicated lemma stating the precise accumulated-error bound (including all constants) immediately before the main complexity theorem. revision: yes
Referee: [§2] Theorem on constant-stepsize PG (likely §2): the 'best-known' iteration complexity is asserted, but the precise dependence on the diameter of the constraint set, smoothness constant, and target accuracy ε should be compared explicitly to the prior reference (e.g., the exact big-O expression) to confirm improvement or equivalence.

Authors: We will revise the theorem statement and surrounding discussion in Section 2 to include the explicit comparison. Our bound is O(D² L / ε²) iterations to reach an ε-stationary point, where D denotes the diameter of the feasible set. This matches the leading term in the best prior results for projected gradient methods on nonconvex problems (e.g., the O(D² L / ε²) rate obtained via standard analysis in Nesterov’s work and subsequent papers on constrained nonconvex optimization). We will insert a short table or paragraph listing the exact big-O expressions from the cited references. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is a new analysis of constant-stepsize projected gradient methods achieving O(1/ε²) complexity for nonconvex problems, followed by an AC-PG variant that estimates the gradient Lipschitz constant from prior first-order information and proves that underestimation error can be controlled without degrading the rate. This is presented as a direct extension of standard PG analysis with explicit error bounds, not as a fit or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided abstract or claim structure; the derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard domain assumptions of smoothness and set convexity that are common in the literature; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The objective function is smooth (gradient is Lipschitz continuous).
Invoked throughout to enable gradient-based methods and Lipschitz estimation.
domain assumption The feasible set is convex and compact.
Required for the projection operator to be well-defined and for boundedness arguments.

pith-pipeline@v0.9.0 · 5711 in / 1401 out tokens · 31960 ms · 2026-05-23T06:21:33.149191+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stochastic Auto-conditioned Fast Gradient Methods with Optimal Rates
math.OC 2026-04 unverdicted novelty 8.0

Stochastic AC-FGM achieves optimal O(1/√ε) iteration complexity and O(1/ε²) sample complexity while being fully adaptive to smoothness, horizon, and noise under bounded conditional variance.
Adaptive Newton-CG methods with global and local analysis for unconstrained optimization with H\"older continuous Hessian
math.OC 2026-04 unverdicted novelty 7.0

Adaptive Newton-CG methods achieve the best-known iteration complexity for epsilon-stationary points in nonconvex optimization with Holder continuous Hessians while ensuring local superlinear convergence.
Universal and Parameter-free Gradient Sliding for Composite Optimization
math.OC 2026-03 unverdicted novelty 7.0

PFUGS is the first parameter-free gradient sliding method for composite convex problems with unknown Hölder and Lipschitz constants, using O((M_ν/ε)^{2/(1+3ν)}) subgradient evaluations of f and O((L/ε)^{1/2}) gradient...
Auto-Conditioned Frank-Wolfe Algorithms
math.OC 2026-05 unverdicted novelty 5.0

Auto-conditioned Frank-Wolfe methods use local Lipschitz estimators from first-order information to achieve convergence guarantees in convex and nonconvex settings without prior global smoothness knowledge.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 4 Pith papers · 2 internal anchors

[1]

≥ γ(i) 2

1 γ(i) t ∥g(i) X,t−1∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2 (ii) ≥ (si−1)(si−2) 8γ(i) si min t∈[1,si−1] ∥g(i) X,t∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2, (2.26) where step (i) follows from the definition of projected gradient in (2.2) and the fact γ(i) 1 ≥ 0, and step (ii) is due to γ(i) si ≥ γ(i) si−1 ≥ ... ≥ γ(i) 2 . Then by summing up Si for i = 1, ..., m, we obta...

work page
[2]

price of adaptivity

− Lt)2] ≤ E[ 1 b′ t (Lt(¯ξt 1))2] ≤ ¯L2 b′ t . (3.26) Then by using Bernstein’s inequality and | ¯Lt − Lt| ≤ ˆL + L almost surely, we have that for any λ > 0, P ¯Lt − Lt ≥ ( ˆL+L)λ 3b′ t + q 2 ¯L2λ b′ t ≤ exp(−λ), (3.27) Projected gradient methods for nonconvex and stochastic optimization 13 which indicates (3.23). In AC-SPG, the output at time step k is ...

work page
[3]

auto-conditioned

Therefore, while this approach may potentially increase a few higher-order terms in sample complexity, it allows us to achieve the iteration complexity of O L2D2 X log δ−1 ϵ2 after ignoring some log-log factors. Nevertheless, all the sample complexities above still exhibit polynomial dependence on δ. To reduce this dependence to a logarithmic factor, we i...

work page
[4]

price of adaptivity

¯L0 + 3kσ2 8 ¯L0N ≤ E X 1≤t≤k,I(t)=0 γt 4 ∥xt − xt−1∥2 + 3 ˆL2D2 X ¯L0 + 3kσ2 8 ¯L0N , (4.38) where the step (i) follows from that for two consecutive I(t) = 1 points, denoted as t1 < t 2, we have ˜Lt2 > 2√ 3 Lt1. Finally, by combining Ineqs. (4.37) and (4.38), and invoking the output rule (4.30), we obtain E hPk t=1 1 4γt ∥gX,R(k)∥2 i = E hPk t=1 1 4γt ∥...

work page
[5]

Andrad´ ottir

S. Andrad´ ottir. A review of simulation optimization techniques. In 1998 winter simulation conference. Proceedings (Cat. No. 98CH36274) , volume 1, pages 151–158. IEEE, 1998

work page 1998
[6]

Arjevani, Y

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023

work page 2023
[7]

L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathe- matics, 16(1):1–3, 1966

work page 1966
[8]

Carmon and O

Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022

work page 2022
[9]

Cartis, N

C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. SIAM journal on optimization , 20(6):2833–2852, 2010

work page 2010
[10]

Cutkosky and F

A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019

work page 2019
[11]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019
[12]

Davis and B

D. Davis and B. Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization , 29(3):1908–1930, 2019

work page 1908
[13]

C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems , 31, 2018

work page 2018
[14]

M. C. Fu. Optimization for simulation: Theory vs. practice. INFORMS Journal on Computing , 14(3):192–215, 2002

work page 2002
[15]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization , 23(4):2341–2368, 2013. Projected gradient methods for nonconvex and stochastic optimization 31

work page 2013
[16]

Ghadimi and G

S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathe- matical Programming, 156(1):59–99, 2016

work page 2016
[17]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for constrained nonconvex stochastic programming. Mathematical Programming, 155:267–305, 2016

work page 2016
[18]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016

work page 2016
[19]

Hare and C

W. Hare and C. Sagastiz´ abal. Computing proximal points of nonconvex functions. Mathematical Programming, 116(1):221–258, 2009

work page 2009
[20]

M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023

work page arXiv 2023
[21]

G. Lan. First-order and stochastic optimization methods for machine learning , volume 1. Springer, 2020

work page 2020
[22]

Lan and T

G. Lan and T. Li. Auto-conditioned primal-dual hybrid gradient method and alternating direction method of multipliers. arXiv preprint arXiv:2410.01979 , 2024

work page arXiv 2024
[23]

Lan and R

G. Lan and R. Monteiro. Private communication. Personal communication, 2018. Unpublished

work page 2018
[24]

G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for convex and non- convex optimization. arXiv preprint arXiv:2310.12139 , 2023

work page arXiv 2023
[25]

Latafat, A

P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. arXiv preprint arXiv:2301.04431 , 2023

work page arXiv 2023
[26]

Li and G

T. Li and G. Lan. A simple uniformly optimal method without line search for convex optimization. arXiv preprint arXiv:2310.10082, 2023

work page arXiv 2023
[27]

Liang and R

J. Liang and R. D. Monteiro. An average curvature accelerated composite gradient method for nonconvex smooth composite optimization problems. SIAM Journal on Optimization , 31(1):217–243, 2021

work page 2021
[28]

Liang and R

J. Liang and R. D. Monteiro. Average curvature fista for nonconvex smooth composite optimization problems. Com- putational Optimization and Applications , 86(1):275–302, 2023

work page 2023
[29]

B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 806–814, 2015

work page 2015
[30]

Mairal, F

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning , pages 689–696, 2009

work page 2009
[31]

Malitsky and K

Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. arXiv preprint arXiv:1910.09529 , 2019

work page arXiv 1910
[32]

Mancino-Ball, S

G. Mancino-Ball, S. Miao, Y. Xu, and J. Chen. Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pages 9055–9063, 2023

work page 2023
[33]

Mason, J

L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent.Advances in neural information processing systems, 12, 1999

work page 1999
[34]

Nesterov

Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). In Doklady an ussr , volume 269, pages 543–547, 1983

work page 1983
[35]

Y. E. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course . Kluwer Academic Publishers, Mas- sachusetts, 2004

work page 2004
[36]

Orabona and D

F. Orabona and D. P´ al. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[37]

M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

work page 2014
[38]

R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization , 14(5):877–898, 1976

work page 1976
[39]

J. V. Shi, Y. Xu, and R. G. Baraniuk. Sparse bilinear logistic regression. arXiv preprint arXiv:1404.4104 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[40]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[41]

Tran-Dinh, N

Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071, 2022

work page 2022
[42]

Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems , 32, 2019

work page 2019
[43]

Xu and Y

Y. Xu and Y. Xu. Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. Journal of Optimization Theory and Applications , 196(1):266–297, 2023

work page 2023
[44]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

D. Zhou, S. Ma, and J. Yang. Adabb: Adaptive barzilai-borwein method for convex optimization. arXiv preprint arXiv:2401.08024, 2024

work page arXiv 2024

[1] [1]

≥ γ(i) 2

1 γ(i) t ∥g(i) X,t−1∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2 (ii) ≥ (si−1)(si−2) 8γ(i) si min t∈[1,si−1] ∥g(i) X,t∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2, (2.26) where step (i) follows from the definition of projected gradient in (2.2) and the fact γ(i) 1 ≥ 0, and step (ii) is due to γ(i) si ≥ γ(i) si−1 ≥ ... ≥ γ(i) 2 . Then by summing up Si for i = 1, ..., m, we obta...

work page

[2] [2]

price of adaptivity

− Lt)2] ≤ E[ 1 b′ t (Lt(¯ξt 1))2] ≤ ¯L2 b′ t . (3.26) Then by using Bernstein’s inequality and | ¯Lt − Lt| ≤ ˆL + L almost surely, we have that for any λ > 0, P ¯Lt − Lt ≥ ( ˆL+L)λ 3b′ t + q 2 ¯L2λ b′ t ≤ exp(−λ), (3.27) Projected gradient methods for nonconvex and stochastic optimization 13 which indicates (3.23). In AC-SPG, the output at time step k is ...

work page

[3] [3]

auto-conditioned

Therefore, while this approach may potentially increase a few higher-order terms in sample complexity, it allows us to achieve the iteration complexity of O L2D2 X log δ−1 ϵ2 after ignoring some log-log factors. Nevertheless, all the sample complexities above still exhibit polynomial dependence on δ. To reduce this dependence to a logarithmic factor, we i...

work page

[4] [4]

price of adaptivity

¯L0 + 3kσ2 8 ¯L0N ≤ E X 1≤t≤k,I(t)=0 γt 4 ∥xt − xt−1∥2 + 3 ˆL2D2 X ¯L0 + 3kσ2 8 ¯L0N , (4.38) where the step (i) follows from that for two consecutive I(t) = 1 points, denoted as t1 < t 2, we have ˜Lt2 > 2√ 3 Lt1. Finally, by combining Ineqs. (4.37) and (4.38), and invoking the output rule (4.30), we obtain E hPk t=1 1 4γt ∥gX,R(k)∥2 i = E hPk t=1 1 4γt ∥...

work page

[5] [5]

Andrad´ ottir

S. Andrad´ ottir. A review of simulation optimization techniques. In 1998 winter simulation conference. Proceedings (Cat. No. 98CH36274) , volume 1, pages 151–158. IEEE, 1998

work page 1998

[6] [6]

Arjevani, Y

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023

work page 2023

[7] [7]

L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathe- matics, 16(1):1–3, 1966

work page 1966

[8] [8]

Carmon and O

Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022

work page 2022

[9] [9]

Cartis, N

C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. SIAM journal on optimization , 20(6):2833–2852, 2010

work page 2010

[10] [10]

Cutkosky and F

A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019

work page 2019

[11] [11]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019

[12] [12]

Davis and B

D. Davis and B. Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization , 29(3):1908–1930, 2019

work page 1908

[13] [13]

C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems , 31, 2018

work page 2018

[14] [14]

M. C. Fu. Optimization for simulation: Theory vs. practice. INFORMS Journal on Computing , 14(3):192–215, 2002

work page 2002

[15] [15]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization , 23(4):2341–2368, 2013. Projected gradient methods for nonconvex and stochastic optimization 31

work page 2013

[16] [16]

Ghadimi and G

S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathe- matical Programming, 156(1):59–99, 2016

work page 2016

[17] [17]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for constrained nonconvex stochastic programming. Mathematical Programming, 155:267–305, 2016

work page 2016

[18] [18]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016

work page 2016

[19] [19]

Hare and C

W. Hare and C. Sagastiz´ abal. Computing proximal points of nonconvex functions. Mathematical Programming, 116(1):221–258, 2009

work page 2009

[20] [20]

M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023

work page arXiv 2023

[21] [21]

G. Lan. First-order and stochastic optimization methods for machine learning , volume 1. Springer, 2020

work page 2020

[22] [22]

Lan and T

G. Lan and T. Li. Auto-conditioned primal-dual hybrid gradient method and alternating direction method of multipliers. arXiv preprint arXiv:2410.01979 , 2024

work page arXiv 2024

[23] [23]

Lan and R

G. Lan and R. Monteiro. Private communication. Personal communication, 2018. Unpublished

work page 2018

[24] [24]

G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for convex and non- convex optimization. arXiv preprint arXiv:2310.12139 , 2023

work page arXiv 2023

[25] [25]

Latafat, A

P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. arXiv preprint arXiv:2301.04431 , 2023

work page arXiv 2023

[26] [26]

Li and G

T. Li and G. Lan. A simple uniformly optimal method without line search for convex optimization. arXiv preprint arXiv:2310.10082, 2023

work page arXiv 2023

[27] [27]

Liang and R

J. Liang and R. D. Monteiro. An average curvature accelerated composite gradient method for nonconvex smooth composite optimization problems. SIAM Journal on Optimization , 31(1):217–243, 2021

work page 2021

[28] [28]

Liang and R

J. Liang and R. D. Monteiro. Average curvature fista for nonconvex smooth composite optimization problems. Com- putational Optimization and Applications , 86(1):275–302, 2023

work page 2023

[29] [29]

B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 806–814, 2015

work page 2015

[30] [30]

Mairal, F

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning , pages 689–696, 2009

work page 2009

[31] [31]

Malitsky and K

Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. arXiv preprint arXiv:1910.09529 , 2019

work page arXiv 1910

[32] [32]

Mancino-Ball, S

G. Mancino-Ball, S. Miao, Y. Xu, and J. Chen. Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pages 9055–9063, 2023

work page 2023

[33] [33]

Mason, J

L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent.Advances in neural information processing systems, 12, 1999

work page 1999

[34] [34]

Nesterov

Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). In Doklady an ussr , volume 269, pages 543–547, 1983

work page 1983

[35] [35]

Y. E. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course . Kluwer Academic Publishers, Mas- sachusetts, 2004

work page 2004

[36] [36]

Orabona and D

F. Orabona and D. P´ al. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[37] [37]

M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

work page 2014

[38] [38]

R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization , 14(5):877–898, 1976

work page 1976

[39] [39]

J. V. Shi, Y. Xu, and R. G. Baraniuk. Sparse bilinear logistic regression. arXiv preprint arXiv:1404.4104 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[40] [40]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[41] [41]

Tran-Dinh, N

Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071, 2022

work page 2022

[42] [42]

Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems , 32, 2019

work page 2019

[43] [43]

Xu and Y

Y. Xu and Y. Xu. Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. Journal of Optimization Theory and Applications , 196(1):266–297, 2023

work page 2023

[44] [44]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

D. Zhou, S. Ma, and J. Yang. Adabb: Adaptive barzilai-borwein method for convex optimization. arXiv preprint arXiv:2401.08024, 2024

work page arXiv 2024