pith. sign in

arxiv: 2509.11043 · v3 · submitted 2025-09-14 · 🧮 math.OC

A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization

Pith reviewed 2026-05-18 17:15 UTC · model grok-4.3

classification 🧮 math.OC
keywords proximal stochastic gradientvariance reductionadaptive step sizeconvex composite optimizationstrong convergenceO(sqrt(1/k)) ratelogistic regressionlasso regression
0
0 comments X

The pith

A proximal stochastic gradient algorithm with variance reduction and adaptive steps converges strongly at an O(sqrt(1/k)) rate for convex composite problems under Lipschitz continuity of the smooth part.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the PSGA method for composite optimization problems consisting of one smooth convex function and one non-smooth convex function. By combining proximal updates with variance-reduced stochastic gradients and an adaptive step-size rule, the method achieves strong convergence of the iterates when the smooth function is Lipschitz continuous. The authors further show that the expected error between the estimated and true gradient vanishes and that the method attains an O(sqrt(1/k)) convergence rate, with supporting numerical results on logistic regression and Lasso regression.

Core claim

The PSGA algorithm establishes strong convergence of the generated sequence together with an O(sqrt(1/k)) rate on the expected error, provided the smooth convex component is Lipschitz continuous, while also proving that the variance-reduced gradient estimate converges to the true gradient in expectation.

What carries the argument

The PSGA iteration, which performs a proximal step on the non-smooth part after taking a variance-reduced stochastic gradient step on the smooth part with an adaptive step-size choice.

If this is right

  • The sequence of iterates converges strongly to a minimizer under the stated Lipschitz condition.
  • The expected difference between the variance-reduced gradient and the true gradient tends to zero.
  • An O(sqrt(1/k)) rate holds for the convergence of the objective values or the distance to optimality.
  • The method is effective on logistic regression and Lasso regression instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The adaptive step-size rule may allow the method to remain stable on problems where a fixed step size would require careful tuning.
  • The variance-reduction component could be swapped for other estimators such as those in SVRG-style methods while preserving the same convergence guarantees.
  • The framework naturally supports large-scale regularized problems in machine learning where the non-smooth term encodes sparsity or other structure.

Load-bearing premise

The smooth convex function is Lipschitz continuous.

What would settle it

A concrete counterexample in which the smooth function violates Lipschitz continuity and the iterates fail to converge strongly or the observed rate falls below O(sqrt(1/k)).

read the original abstract

In this paper, we propose a proximal stochasitc gradient algorithm (PSGA) for solving composite optimization problems by incorporating variance reduction techniques and an adaptive step-size strategy. In the PSGA method, the objective function consists of two components: one is a smooth convex function, and the other is a non-smooth convex function. We establish the strong convergence of the proposed method, provided that the smooth convex function is Lipschitz continuous. We also prove that the expected value of the error between the estimated gradient and the actual gradient converges to zero. Furthermore, we get an \( O(\sqrt{1/k}) \) convergence rate for our method. Finally, the effectiveness of the proposed method is validated through numerical experiments on Logistic regression and Lasso regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes a proximal stochastic gradient algorithm (PSGA) for convex composite optimization problems of the form minimize f(x) + g(x), where f is smooth convex and g is non-smooth convex. The method combines variance reduction with an adaptive step-size rule. The authors claim to prove strong convergence of the iterates under the assumption that the smooth convex function is Lipschitz continuous, to show that the expected error of the variance-reduced gradient estimator to the true gradient converges to zero, to derive an O(sqrt(1/k)) convergence rate, and to validate the approach via numerical experiments on logistic regression and Lasso regression.

Significance. If the convergence analysis is correct, the work adds a variant of variance-reduced proximal stochastic gradient methods that incorporates adaptive stepsizes while retaining strong convergence and a standard sublinear rate. This combination may be useful for large-scale machine-learning problems where fixed-step methods are inefficient. The numerical validation on standard regression tasks is appropriate, though the contribution would be strengthened by more extensive baselines and statistical reporting.

minor comments (3)
  1. [Abstract] Abstract: 'proximal stochasitc gradient algorithm' contains a typo and should read 'stochastic'.
  2. [Abstract] Abstract: The assumption is phrased as 'the smooth convex function is Lipschitz continuous'. Standard terminology in this setting is that the gradient of the smooth function is Lipschitz continuous (i.e., f is L-smooth). Please clarify the precise assumption and its placement in the main text.
  3. [Numerical experiments] Numerical experiments section: The description supplies no information on the number of independent runs, error bars, or explicit baseline comparisons (e.g., against SVRG or standard proximal SGD). Adding these details would improve reproducibility and allow readers to assess practical gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. The referee summary accurately captures the main elements of our work: the PSGA algorithm combining proximal stochastic gradient steps with variance reduction and adaptive step sizes, the strong convergence result for convex composite problems under the Lipschitz continuity assumption on the smooth term, the proof that the expected error of the variance-reduced estimator converges to zero, the O(sqrt(1/k)) rate, and the numerical validation on logistic and Lasso regression. No specific major comments were enumerated in the report, so we provide no point-by-point rebuttals to major comments. We acknowledge the observation that more extensive baselines and statistical reporting would strengthen the experiments and will address this in revision.

Circularity Check

0 steps flagged

No significant circularity; analysis rests on standard assumptions

full rationale

The paper proposes PSGA for composite convex optimization and proves strong convergence plus O(sqrt(1/k)) rate under the assumption that the smooth component has Lipschitz continuous gradient. The key steps show that the variance-reduced estimator has error expectation converging to zero, which is then used to bound the proximal mapping iterates and obtain almost-sure convergence. These relations follow from convexity, unbiasedness of the stochastic gradients, and standard summability arguments on the adaptive steps; none of the quantities are defined in terms of the final convergence rate or fitted to the outputs they are used to prove. The derivation is therefore self-contained against external benchmarks and does not reduce to self-definition, fitted-input renaming, or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions of convexity and Lipschitz continuity of the smooth term; no free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (2)
  • domain assumption The smooth component of the objective is convex and Lipschitz continuous.
    Explicitly required for the strong convergence and rate statements in the abstract.
  • domain assumption The non-smooth component is convex (so that the proximal operator is well-defined).
    Implicit in the use of proximal stochastic gradient steps for composite problems.

pith-pipeline@v0.9.0 · 5657 in / 1325 out tokens · 38618 ms · 2026-05-18T17:15:15.649956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

    L. Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

  2. [2]

    Ash, and C.A

    R.B. Ash, and C.A. Doleans-Dade, Probability and Measure Theory, Academic Press, 2000

  3. [3]

    Barzilai, and J.M

    J. Barzilai, and J.M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8(1) (1988), 141-148

  4. [4]

    Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

    A. Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

  5. [5]

    Bottou, F.E

    L. Bottou, F.E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60(2) (2018), 223-311

  6. [6]

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3(1) (2011), 1-122

  7. [7]

    Burdakov, Y

    O. Burdakov, Y. Dai, and N. Huang, Stabilized Barzilai-Borwein method, Journal of Computa- tional Mathematics, 37(6) (2019), 916-936

  8. [8]

    Candes, and M.B

    E.J. Candes, and M.B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, 25(2) (2008), 21-30

  9. [9]

    Chang, and C.J

    C.C. Chang, and C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2(3) (2011), 1-27

  10. [10]

    Chen, F.E

    T. Chen, F.E. Curtis, and D.P. Robinson, A reduced-space algorithm for minimizing ℓ1- regularized convex functions, SIAM Journal on Optimization, 27(3) (2017), 1583-1610

  11. [11]

    Cutkosky, and F

    A. Cutkosky, and F. Orabona, Momentum-based variance reduction in non-convex SGD, Ad- vances in Neural Information Processing Systems 32, 2019, 15236-15245

  12. [12]

    Y. Dai, G. Wang, F.E. Curtis, and D.P. Robinson, A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization, In International Conference on Artificial Intelligence and Statistics, 2023, 5107-5133

  13. [13]

    Defazio, F

    A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27, 2014, 1646-1654

  14. [14]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochas- tic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159

  15. [15]

    Durrett, Probability: theory and examples

    R. Durrett, Probability: theory and examples. Vol. 49. Cambridge University Press, 2019

  16. [16]

    C. Fang, L. Hu, and S. Chen, An inexact primal-dual method with correction step for a saddle point problem in image debluring, Journal of Global Optimization, 87(2)(2023), 965-988

  17. [17]

    Grimmer, S

    B. Grimmer, S. Kevin, and A.L. Wang, Accelerated gradient descent via long steps, arXiv preprint arXiv:2309.09961 (2023)

  18. [18]

    Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol

    T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, New York: Springer, 2009

  19. [19]

    Huang, and C

    Z.S. Huang, and C. Lee, Training structured neural networks through manifold identification and variance reduction, arXiv preprint arXiv:2112.02612 (2021). 19

  20. [20]

    Johnson, and T

    R. Johnson, and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26, 2013, 315-323

  21. [21]

    Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

    A.N. Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

  22. [22]

    G. Lan, Y. Ouyang, and Z. Zhang, Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, arXiv preprint arXiv:2310.12139 (2023)

  23. [23]

    Latafat, A

    P. Latafat, A. Themelis, L. Stella, and P. Patrinos, Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient, Mathematical Programming, 2024. https://doi.org/10.1007/s10107-024-02143-7

  24. [24]

    Latafat, A

    P. Latafat, A. Themelis, and P. Patrinos, On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms, 6th Annual Learning for Dynamics and Control Conference, 2024, 197-208

  25. [25]

    Li, and G

    T. Li, and G. Lan, A simple uniformly optimal method without line search for convex optimiza- tion, arXiv preprint arXiv:2310.10082 (2023)

  26. [26]

    Lin, Probability inequalities, Springer, 2010

    Z. Lin, Probability inequalities, Springer, 2010

  27. [27]

    Liu, T.D

    Z. Liu, T.D. Nguyen, T.H. Nguyen, A. Ene, and H.L. Nguyen, META-STORM: Generalized fully-adaptive variance reduced SGD for unbounded functions, arXiv preprint arXiv:2209.14853 (2022)

  28. [28]

    Malitsky, and K

    Y. Malitsky, and K. Mishchenko, Adaptive gradient descent without descent, arXiv preprint arXiv:1910.09529 (2019)

  29. [29]

    Malitsky, and K

    Y. Malitsky, and K. Mishchenko, Adaptive proximal gradient method for convex optimization, Advances in Neural Information Processing Systems 37, 2024, 100670-100697

  30. [30]

    Milzarek, F

    A. Milzarek, F. Schaipp, and M. Ulbrich, A semismooth Newton stochastic proximal point algorithm with variance reduction, SIAM Journal on Optimization, 34(1) (2024), 1157-1185

  31. [31]

    S. Na, M. Derezinski, and M.W. Mahoney, Hessian averaging in stochastic Newton methods achieves superlinear convergence, Mathematical Programming, 201(1) (2023), 473-520

  32. [32]

    Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

    M. Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

  33. [33]

    Nesterov, Lectures on convex optimization, Vol

    Y. Nesterov, Lectures on convex optimization, Vol. 137, Berlin: Springer International Publish- ing, 2018

  34. [34]

    Nguyen, J

    L.M. Nguyen, J. Liu, K. Scheinberg, and M. Takac, SARAH: A novel method for machine learn- ing problems using stochastic recursive gradient, International Conference on Machine Learning, 2017, 2613-2621

  35. [35]

    Pham, L.M

    N.H. Pham, L.M. Nguyen, D.T. Phan, and Q. Tran-Dinh, ProxSARAH: An efficient algorith- mic framework for stochastic composite nonconvex optimization, Journal of Machine Learning Research, 21(110) (2020), 1-48

  36. [36]

    D.N. Phan, S. Bartz, N. Guha, and H.M. Phan, Stochastic Variance-Reduced Majorization- Minimization Algorithms, SIAM Journal on Mathematics of Data Science, 6(4) (2024), 926-952

  37. [37]

    Phan, and N

    D.N. Phan, and N. Gillis, An inertial block majorization minimization framework for nonsmooth nonconvex optimization, Journal of Machine Learning Research, 24(18) (2023), 1-41. 20

  38. [38]

    Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

    M. Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

  39. [39]

    Robbins, and S

    H. Robbins, and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 1951, 400-407

  40. [40]

    Robbins, and D

    H. Robbins, and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233-257

  41. [41]

    Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

    R.T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

  42. [42]

    C. Tan, S. Ma, Y.H. Dai, and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems 29, 2016, 685-693

  43. [43]

    Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

    R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

  44. [44]

    Tran-Dinh, N.H

    Q. Tran-Dinh, N.H. Pham, D.T. Phan, and L.M. Nguyen, A hybrid stochastic optimization framework for composite nonconvex optimization, Mathematical Programming, 191(2) (2022), 1005-1071

  45. [45]

    Traore, V

    C. Traore, V. Apidopoulos, S. Salzo, and S. Villa, Variance reduction techniques for stochastic proximal point algorithms, Journal of Optimization Theory and Applications, 203(2) (2024), 1910-1939

  46. [46]

    Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

    L. Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

  47. [47]

    Xiao, and T

    L. Xiao, and T. Zhang, A proximal stochastic gradient method with progressive variance reduc- tion, SIAM Journal on Optimization, 24(4) (2014), 2057-2075

  48. [48]

    Xu, and Y

    Y. Xu, and Y. Xu, Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization, Journal of Optimization Theory and Applications, 196(1) (2023), 266-297

  49. [49]

    Yang, and H

    Y. Yang, and H. Zou, A fast unified algorithm for solving group-lasso penalize learning problems, Statistics and Computing, 25(6) (2015), 1129-1141

  50. [50]

    D. Zhou, S. Ma, and J. Yang, AdaBB: Adaptive Barzilai-Borwein Method for Convex Optimiza- tion, arXiv preprint arXiv:2401.08024 (2024)

  51. [51]

    Zou, and T

    H. Zou, and T. Hastie, Regularization and variable selection via the elastic net, Journal of The Royal Statistical Society Series B: Statistical Methodology, 67(2) (2005), 301-320. 21