pith. sign in

arxiv: 2412.14291 · v2 · pith:F35EYBTAnew · submitted 2024-12-18 · 🧮 math.OC · cs.LG· stat.ML

Projected gradient methods for nonconvex and stochastic smooth optimization: new complexities and auto-conditioned stepsizes

Pith reviewed 2026-05-23 06:21 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords projected gradient methodsnonconvex optimizationstochastic optimizationauto-conditioned stepsizesLipschitz constant estimationiteration complexitystationary pointsvariance reduction
0
0 comments X

The pith

Auto-conditioned projected gradient methods achieve the same iteration complexity as constant-stepsize PG without needing the Lipschitz constant or line search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops projected gradient methods for minimizing smooth nonconvex functions over convex compact sets. It first gives a new analysis of the constant-stepsize projected gradient method that attains the best-known iteration complexity for finding an approximate stationary point. The main contribution is an auto-conditioned projected gradient variant that estimates the Lipschitz constant from first-order information in prior iterations and shows the underestimation error can be controlled to match the same complexity, without requiring the constant as input or any line search. The framework extends to stochastic projected gradient and variance-reduced variants, delivering new complexity bounds under different oracle models along with auto-conditioned stepsize policies that preserve comparable guarantees.

Core claim

By estimating the Lipschitz constant using first-order information gathered from the previous iterations, and showing that the error caused by underestimating the Lipschitz constant can be properly controlled, the auto-conditioned projected gradient method achieves the same iteration complexity as the constant-stepsize projected gradient method for finding an approximate stationary point of a smooth nonconvex function over a convex compact set, without requiring the input of the Lipschitz constant or any line search procedure. The methods are also generalized to the stochastic setting with new complexity bounds.

What carries the argument

The auto-conditioned projected gradient (AC-PG) variant, which estimates the Lipschitz constant on the fly from previous first-order information while controlling underestimation error.

If this is right

  • The constant-stepsize PG method attains the best-known iteration complexity for nonconvex smooth optimization over convex sets.
  • The AC-PG method matches this complexity without knowledge of the Lipschitz constant or use of line search.
  • Stochastic PG and variance-reduced SPG methods deliver new complexity bounds in different oracle settings.
  • Auto-conditioned stepsize policies for the stochastic variants achieve comparable convergence guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Implementation becomes simpler for problems where the Lipschitz constant is unknown or expensive to compute in advance.
  • The error-control technique for underestimation may transfer to adaptive stepsize rules in other first-order methods for nonconvex problems.
  • These policies could reduce the need for manual stepsize tuning across a range of optimization applications.

Load-bearing premise

The error caused by underestimating the Lipschitz constant can be properly controlled so that it does not degrade the overall convergence rate.

What would settle it

A concrete smooth nonconvex function over a convex set where the underestimation error from the auto-conditioned Lipschitz estimate accumulates and produces strictly worse than the claimed iteration complexity for reaching an epsilon-stationary point.

Figures

Figures reproduced from arXiv: 2412.14291 by Guanghui Lan, Tianjiao Li, Yangyang Xu.

Figure 1
Figure 1. Figure 1: Average and standard deviation results by the PG and AC-PG methods on solving 10 independently [PITH_FULL_IMAGE:figures/full_fig_p027_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving 10 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average and standard deviation results by SPG, VR-SPG, AC-SPG, and AC-VR-SPG on solving [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
read the original abstract

We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the constant-stepsize PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper provides a new analysis of constant-stepsize projected gradient (PG) methods for smooth nonconvex minimization over convex compact sets, claiming best-known O(1/ε²) iteration complexity for approximate stationarity. It introduces an auto-conditioned PG (AC-PG) variant that estimates the gradient Lipschitz constant L from prior first-order information and controls underestimation error to match the same complexity without inputting L or using line search. Stochastic extensions (SPG, VR-SPG) and their auto-conditioned versions are analyzed with new oracle complexities under different settings.

Significance. If the error-control argument for underestimation holds, the auto-conditioned stepsize removes a practical barrier to applying PG methods while preserving optimal rates, which is a meaningful contribution for implementation. The stochastic complexity results, if tighter than prior work, would also be useful. No machine-checked proofs or reproducible code are mentioned, but the parameter-free derivation via auto-conditioning is a clear strength if rigorously established.

major comments (2)
  1. [§3] Abstract and §3 (AC-PG construction): the claim that underestimation error from the Lipschitz estimate can be controlled without degrading the O(1/ε²) rate is the central technical step, yet the specific descent inequality or potential-function argument that absorbs the error term while preserving the complexity is not visible in the provided description; this must be made explicit with the precise bound on the accumulated error.
  2. [§2] Theorem on constant-stepsize PG (likely §2): the 'best-known' iteration complexity is asserted, but the precise dependence on the diameter of the constraint set, smoothness constant, and target accuracy ε should be compared explicitly to the prior reference (e.g., the exact big-O expression) to confirm improvement or equivalence.
minor comments (2)
  1. [§3] Notation for the estimated Lipschitz constant (e.g., L_k) should be introduced with a clear recursive definition and boundedness argument early in the AC-PG section.
  2. [§5] The stochastic sections would benefit from a table summarizing the oracle complexities (gradient evaluations, variance-reduced calls) across SPG, VR-SPG, and their auto-conditioned variants for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestions. We address the two major comments below and will revise the manuscript to improve clarity on the technical arguments and comparisons.

read point-by-point responses
  1. Referee: [§3] Abstract and §3 (AC-PG construction): the claim that underestimation error from the Lipschitz estimate can be controlled without degrading the O(1/ε²) rate is the central technical step, yet the specific descent inequality or potential-function argument that absorbs the error term while preserving the complexity is not visible in the provided description; this must be made explicit with the precise bound on the accumulated error.

    Authors: We agree that the error-control argument requires explicit highlighting. The full analysis in Section 3 derives a descent inequality for the AC-PG iterates by using the underestimated stepsize η_k = 1/L_k, where L_k is the running estimate. The resulting inequality includes an additive error term bounded by the sum of ||g_{t} - g_{t-1}||² over prior steps; this sum is controlled via a telescoping argument that absorbs into the potential function without changing the leading O(1/ε²) term. We will add a dedicated lemma stating the precise accumulated-error bound (including all constants) immediately before the main complexity theorem. revision: yes

  2. Referee: [§2] Theorem on constant-stepsize PG (likely §2): the 'best-known' iteration complexity is asserted, but the precise dependence on the diameter of the constraint set, smoothness constant, and target accuracy ε should be compared explicitly to the prior reference (e.g., the exact big-O expression) to confirm improvement or equivalence.

    Authors: We will revise the theorem statement and surrounding discussion in Section 2 to include the explicit comparison. Our bound is O(D² L / ε²) iterations to reach an ε-stationary point, where D denotes the diameter of the feasible set. This matches the leading term in the best prior results for projected gradient methods on nonconvex problems (e.g., the O(D² L / ε²) rate obtained via standard analysis in Nesterov’s work and subsequent papers on constrained nonconvex optimization). We will insert a short table or paragraph listing the exact big-O expressions from the cited references. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is a new analysis of constant-stepsize projected gradient methods achieving O(1/ε²) complexity for nonconvex problems, followed by an AC-PG variant that estimates the gradient Lipschitz constant from prior first-order information and proves that underestimation error can be controlled without degrading the rate. This is presented as a direct extension of standard PG analysis with explicit error bounds, not as a fit or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided abstract or claim structure; the derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard domain assumptions of smoothness and set convexity that are common in the literature; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The objective function is smooth (gradient is Lipschitz continuous).
    Invoked throughout to enable gradient-based methods and Lipschitz estimation.
  • domain assumption The feasible set is convex and compact.
    Required for the projection operator to be well-defined and for boundedness arguments.

pith-pipeline@v0.9.0 · 5711 in / 1401 out tokens · 31960 ms · 2026-05-23T06:21:33.149191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stochastic Auto-conditioned Fast Gradient Methods with Optimal Rates

    math.OC 2026-04 unverdicted novelty 8.0

    Stochastic AC-FGM achieves optimal O(1/√ε) iteration complexity and O(1/ε²) sample complexity while being fully adaptive to smoothness, horizon, and noise under bounded conditional variance.

  2. Adaptive Newton-CG methods with global and local analysis for unconstrained optimization with H\"older continuous Hessian

    math.OC 2026-04 unverdicted novelty 7.0

    Adaptive Newton-CG methods achieve the best-known iteration complexity for epsilon-stationary points in nonconvex optimization with Holder continuous Hessians while ensuring local superlinear convergence.

  3. Universal and Parameter-free Gradient Sliding for Composite Optimization

    math.OC 2026-03 unverdicted novelty 7.0

    PFUGS is the first parameter-free gradient sliding method for composite convex problems with unknown Hölder and Lipschitz constants, using O((M_ν/ε)^{2/(1+3ν)}) subgradient evaluations of f and O((L/ε)^{1/2}) gradient...

  4. Auto-Conditioned Frank-Wolfe Algorithms

    math.OC 2026-05 unverdicted novelty 5.0

    Auto-conditioned Frank-Wolfe methods use local Lipschitz estimators from first-order information to achieve convergence guarantees in convex and nonconvex settings without prior global smoothness knowledge.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 4 Pith papers · 2 internal anchors

  1. [1]

    ≥ γ(i) 2

    1 γ(i) t ∥g(i) X,t−1∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2 (ii) ≥ (si−1)(si−2) 8γ(i) si min t∈[1,si−1] ∥g(i) X,t∥2 − L(i) 1 2 ∥x(i) 1 − x(i) 0 ∥2, (2.26) where step (i) follows from the definition of projected gradient in (2.2) and the fact γ(i) 1 ≥ 0, and step (ii) is due to γ(i) si ≥ γ(i) si−1 ≥ ... ≥ γ(i) 2 . Then by summing up Si for i = 1, ..., m, we obta...

  2. [2]

    price of adaptivity

    − Lt)2] ≤ E[ 1 b′ t (Lt(¯ξt 1))2] ≤ ¯L2 b′ t . (3.26) Then by using Bernstein’s inequality and | ¯Lt − Lt| ≤ ˆL + L almost surely, we have that for any λ > 0, P ¯Lt − Lt ≥ ( ˆL+L)λ 3b′ t + q 2 ¯L2λ b′ t ≤ exp(−λ), (3.27) Projected gradient methods for nonconvex and stochastic optimization 13 which indicates (3.23). In AC-SPG, the output at time step k is ...

  3. [3]

    auto-conditioned

    Therefore, while this approach may potentially increase a few higher-order terms in sample complexity, it allows us to achieve the iteration complexity of O L2D2 X log δ−1 ϵ2 after ignoring some log-log factors. Nevertheless, all the sample complexities above still exhibit polynomial dependence on δ. To reduce this dependence to a logarithmic factor, we i...

  4. [4]

    price of adaptivity

    ¯L0 + 3kσ2 8 ¯L0N ≤ E X 1≤t≤k,I(t)=0 γt 4 ∥xt − xt−1∥2 + 3 ˆL2D2 X ¯L0 + 3kσ2 8 ¯L0N , (4.38) where the step (i) follows from that for two consecutive I(t) = 1 points, denoted as t1 < t 2, we have ˜Lt2 > 2√ 3 Lt1. Finally, by combining Ineqs. (4.37) and (4.38), and invoking the output rule (4.30), we obtain E hPk t=1 1 4γt ∥gX,R(k)∥2 i = E hPk t=1 1 4γt ∥...

  5. [5]

    Andrad´ ottir

    S. Andrad´ ottir. A review of simulation optimization techniques. In 1998 winter simulation conference. Proceedings (Cat. No. 98CH36274) , volume 1, pages 151–158. IEEE, 1998

  6. [6]

    Arjevani, Y

    Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023

  7. [7]

    L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathe- matics, 16(1):1–3, 1966

  8. [8]

    Carmon and O

    Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022

  9. [9]

    Cartis, N

    C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. SIAM journal on optimization , 20(6):2833–2852, 2010

  10. [10]

    Cutkosky and F

    A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019

  11. [11]

    Davis and D

    D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

  12. [12]

    Davis and B

    D. Davis and B. Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization , 29(3):1908–1930, 2019

  13. [13]

    C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems , 31, 2018

  14. [14]

    M. C. Fu. Optimization for simulation: Theory vs. practice. INFORMS Journal on Computing , 14(3):192–215, 2002

  15. [15]

    Ghadimi and G

    S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization , 23(4):2341–2368, 2013. Projected gradient methods for nonconvex and stochastic optimization 31

  16. [16]

    Ghadimi and G

    S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathe- matical Programming, 156(1):59–99, 2016

  17. [17]

    Ghadimi, G

    S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for constrained nonconvex stochastic programming. Mathematical Programming, 155:267–305, 2016

  18. [18]

    Ghadimi, G

    S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016

  19. [19]

    Hare and C

    W. Hare and C. Sagastiz´ abal. Computing proximal points of nonconvex functions. Mathematical Programming, 116(1):221–258, 2009

  20. [20]

    M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023

  21. [21]

    G. Lan. First-order and stochastic optimization methods for machine learning , volume 1. Springer, 2020

  22. [22]

    Lan and T

    G. Lan and T. Li. Auto-conditioned primal-dual hybrid gradient method and alternating direction method of multipliers. arXiv preprint arXiv:2410.01979 , 2024

  23. [23]

    Lan and R

    G. Lan and R. Monteiro. Private communication. Personal communication, 2018. Unpublished

  24. [24]

    G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for convex and non- convex optimization. arXiv preprint arXiv:2310.12139 , 2023

  25. [25]

    Latafat, A

    P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. arXiv preprint arXiv:2301.04431 , 2023

  26. [26]

    Li and G

    T. Li and G. Lan. A simple uniformly optimal method without line search for convex optimization. arXiv preprint arXiv:2310.10082, 2023

  27. [27]

    Liang and R

    J. Liang and R. D. Monteiro. An average curvature accelerated composite gradient method for nonconvex smooth composite optimization problems. SIAM Journal on Optimization , 31(1):217–243, 2021

  28. [28]

    Liang and R

    J. Liang and R. D. Monteiro. Average curvature fista for nonconvex smooth composite optimization problems. Com- putational Optimization and Applications , 86(1):275–302, 2023

  29. [29]

    B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 806–814, 2015

  30. [30]

    Mairal, F

    J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning , pages 689–696, 2009

  31. [31]

    Malitsky and K

    Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. arXiv preprint arXiv:1910.09529 , 2019

  32. [32]

    Mancino-Ball, S

    G. Mancino-Ball, S. Miao, Y. Xu, and J. Chen. Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pages 9055–9063, 2023

  33. [33]

    Mason, J

    L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent.Advances in neural information processing systems, 12, 1999

  34. [34]

    Nesterov

    Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). In Doklady an ussr , volume 269, pages 543–547, 1983

  35. [35]

    Y. E. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course . Kluwer Academic Publishers, Mas- sachusetts, 2004

  36. [36]

    Orabona and D

    F. Orabona and D. P´ al. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016

  37. [37]

    M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

  38. [38]

    R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization , 14(5):877–898, 1976

  39. [39]

    J. V. Shi, Y. Xu, and R. G. Baraniuk. Sparse bilinear logistic regression. arXiv preprint arXiv:1404.4104 , 2014

  40. [40]

    R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018

  41. [41]

    Tran-Dinh, N

    Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071, 2022

  42. [42]

    Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems , 32, 2019

  43. [43]

    Xu and Y

    Y. Xu and Y. Xu. Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. Journal of Optimization Theory and Applications , 196(1):266–297, 2023

  44. [44]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223 , 2023

  45. [45]

    D. Zhou, S. Ma, and J. Yang. Adabb: Adaptive barzilai-borwein method for convex optimization. arXiv preprint arXiv:2401.08024, 2024