A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization
Pith reviewed 2026-05-18 17:15 UTC · model grok-4.3
The pith
A proximal stochastic gradient algorithm with variance reduction and adaptive steps converges strongly at an O(sqrt(1/k)) rate for convex composite problems under Lipschitz continuity of the smooth part.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The PSGA algorithm establishes strong convergence of the generated sequence together with an O(sqrt(1/k)) rate on the expected error, provided the smooth convex component is Lipschitz continuous, while also proving that the variance-reduced gradient estimate converges to the true gradient in expectation.
What carries the argument
The PSGA iteration, which performs a proximal step on the non-smooth part after taking a variance-reduced stochastic gradient step on the smooth part with an adaptive step-size choice.
If this is right
- The sequence of iterates converges strongly to a minimizer under the stated Lipschitz condition.
- The expected difference between the variance-reduced gradient and the true gradient tends to zero.
- An O(sqrt(1/k)) rate holds for the convergence of the objective values or the distance to optimality.
- The method is effective on logistic regression and Lasso regression instances.
Where Pith is reading between the lines
- The adaptive step-size rule may allow the method to remain stable on problems where a fixed step size would require careful tuning.
- The variance-reduction component could be swapped for other estimators such as those in SVRG-style methods while preserving the same convergence guarantees.
- The framework naturally supports large-scale regularized problems in machine learning where the non-smooth term encodes sparsity or other structure.
Load-bearing premise
The smooth convex function is Lipschitz continuous.
What would settle it
A concrete counterexample in which the smooth function violates Lipschitz continuity and the iterates fail to converge strongly or the observed rate falls below O(sqrt(1/k)).
read the original abstract
In this paper, we propose a proximal stochasitc gradient algorithm (PSGA) for solving composite optimization problems by incorporating variance reduction techniques and an adaptive step-size strategy. In the PSGA method, the objective function consists of two components: one is a smooth convex function, and the other is a non-smooth convex function. We establish the strong convergence of the proposed method, provided that the smooth convex function is Lipschitz continuous. We also prove that the expected value of the error between the estimated gradient and the actual gradient converges to zero. Furthermore, we get an \( O(\sqrt{1/k}) \) convergence rate for our method. Finally, the effectiveness of the proposed method is validated through numerical experiments on Logistic regression and Lasso regression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a proximal stochastic gradient algorithm (PSGA) for convex composite optimization problems of the form minimize f(x) + g(x), where f is smooth convex and g is non-smooth convex. The method combines variance reduction with an adaptive step-size rule. The authors claim to prove strong convergence of the iterates under the assumption that the smooth convex function is Lipschitz continuous, to show that the expected error of the variance-reduced gradient estimator to the true gradient converges to zero, to derive an O(sqrt(1/k)) convergence rate, and to validate the approach via numerical experiments on logistic regression and Lasso regression.
Significance. If the convergence analysis is correct, the work adds a variant of variance-reduced proximal stochastic gradient methods that incorporates adaptive stepsizes while retaining strong convergence and a standard sublinear rate. This combination may be useful for large-scale machine-learning problems where fixed-step methods are inefficient. The numerical validation on standard regression tasks is appropriate, though the contribution would be strengthened by more extensive baselines and statistical reporting.
minor comments (3)
- [Abstract] Abstract: 'proximal stochasitc gradient algorithm' contains a typo and should read 'stochastic'.
- [Abstract] Abstract: The assumption is phrased as 'the smooth convex function is Lipschitz continuous'. Standard terminology in this setting is that the gradient of the smooth function is Lipschitz continuous (i.e., f is L-smooth). Please clarify the precise assumption and its placement in the main text.
- [Numerical experiments] Numerical experiments section: The description supplies no information on the number of independent runs, error bars, or explicit baseline comparisons (e.g., against SVRG or standard proximal SGD). Adding these details would improve reproducibility and allow readers to assess practical gains.
Simulated Author's Rebuttal
We thank the referee for the careful review and the recommendation of minor revision. The referee summary accurately captures the main elements of our work: the PSGA algorithm combining proximal stochastic gradient steps with variance reduction and adaptive step sizes, the strong convergence result for convex composite problems under the Lipschitz continuity assumption on the smooth term, the proof that the expected error of the variance-reduced estimator converges to zero, the O(sqrt(1/k)) rate, and the numerical validation on logistic and Lasso regression. No specific major comments were enumerated in the report, so we provide no point-by-point rebuttals to major comments. We acknowledge the observation that more extensive baselines and statistical reporting would strengthen the experiments and will address this in revision.
Circularity Check
No significant circularity; analysis rests on standard assumptions
full rationale
The paper proposes PSGA for composite convex optimization and proves strong convergence plus O(sqrt(1/k)) rate under the assumption that the smooth component has Lipschitz continuous gradient. The key steps show that the variance-reduced estimator has error expectation converging to zero, which is then used to bound the proximal mapping iterates and obtain almost-sure convergence. These relations follow from convexity, unbiasedness of the stochastic gradients, and standard summability arguments on the adaptive steps; none of the quantities are defined in terms of the final convergence rate or fitted to the outputs they are used to prove. The derivation is therefore self-contained against external benchmarks and does not reduce to self-definition, fitted-input renaming, or self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The smooth component of the objective is convex and Lipschitz continuous.
- domain assumption The non-smooth component is convex (so that the proximal operator is well-defined).
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish the strong convergence of the proposed method, provided that the smooth convex function is Lipschitz continuous. ... we get an O(sqrt(1/k)) convergence rate
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Step 5. Set step size: If τk ≥ ηk−1, set ηk = (1 + 1/τk) ηk−1 ...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L. Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3
work page 1966
-
[2]
R.B. Ash, and C.A. Doleans-Dade, Probability and Measure Theory, Academic Press, 2000
work page 2000
-
[3]
J. Barzilai, and J.M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8(1) (1988), 141-148
work page 1988
-
[4]
Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017
A. Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017
work page 2017
-
[5]
L. Bottou, F.E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60(2) (2018), 223-311
work page 2018
-
[6]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3(1) (2011), 1-122
work page 2011
-
[7]
O. Burdakov, Y. Dai, and N. Huang, Stabilized Barzilai-Borwein method, Journal of Computa- tional Mathematics, 37(6) (2019), 916-936
work page 2019
-
[8]
E.J. Candes, and M.B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, 25(2) (2008), 21-30
work page 2008
-
[9]
C.C. Chang, and C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2(3) (2011), 1-27
work page 2011
- [10]
-
[11]
A. Cutkosky, and F. Orabona, Momentum-based variance reduction in non-convex SGD, Ad- vances in Neural Information Processing Systems 32, 2019, 15236-15245
work page 2019
-
[12]
Y. Dai, G. Wang, F.E. Curtis, and D.P. Robinson, A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization, In International Conference on Artificial Intelligence and Statistics, 2023, 5107-5133
work page 2023
-
[13]
A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27, 2014, 1646-1654
work page 2014
- [14]
-
[15]
Durrett, Probability: theory and examples
R. Durrett, Probability: theory and examples. Vol. 49. Cambridge University Press, 2019
work page 2019
-
[16]
C. Fang, L. Hu, and S. Chen, An inexact primal-dual method with correction step for a saddle point problem in image debluring, Journal of Global Optimization, 87(2)(2023), 965-988
work page 2023
-
[17]
B. Grimmer, S. Kevin, and A.L. Wang, Accelerated gradient descent via long steps, arXiv preprint arXiv:2309.09961 (2023)
-
[18]
Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol
T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, New York: Springer, 2009
work page 2009
-
[19]
Z.S. Huang, and C. Lee, Training structured neural networks through manifold identification and variance reduction, arXiv preprint arXiv:2112.02612 (2021). 19
-
[20]
R. Johnson, and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26, 2013, 315-323
work page 2013
-
[21]
A.N. Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018
work page 2018
- [22]
-
[23]
P. Latafat, A. Themelis, L. Stella, and P. Patrinos, Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient, Mathematical Programming, 2024. https://doi.org/10.1007/s10107-024-02143-7
-
[24]
P. Latafat, A. Themelis, and P. Patrinos, On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms, 6th Annual Learning for Dynamics and Control Conference, 2024, 197-208
work page 2024
- [25]
-
[26]
Lin, Probability inequalities, Springer, 2010
Z. Lin, Probability inequalities, Springer, 2010
work page 2010
- [27]
-
[28]
Y. Malitsky, and K. Mishchenko, Adaptive gradient descent without descent, arXiv preprint arXiv:1910.09529 (2019)
-
[29]
Y. Malitsky, and K. Mishchenko, Adaptive proximal gradient method for convex optimization, Advances in Neural Information Processing Systems 37, 2024, 100670-100697
work page 2024
-
[30]
A. Milzarek, F. Schaipp, and M. Ulbrich, A semismooth Newton stochastic proximal point algorithm with variance reduction, SIAM Journal on Optimization, 34(1) (2024), 1157-1185
work page 2024
-
[31]
S. Na, M. Derezinski, and M.W. Mahoney, Hessian averaging in stochastic Newton methods achieves superlinear convergence, Mathematical Programming, 201(1) (2023), 473-520
work page 2023
-
[32]
M. Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569
work page 2025
-
[33]
Nesterov, Lectures on convex optimization, Vol
Y. Nesterov, Lectures on convex optimization, Vol. 137, Berlin: Springer International Publish- ing, 2018
work page 2018
- [34]
- [35]
-
[36]
D.N. Phan, S. Bartz, N. Guha, and H.M. Phan, Stochastic Variance-Reduced Majorization- Minimization Algorithms, SIAM Journal on Mathematics of Data Science, 6(4) (2024), 926-952
work page 2024
-
[37]
D.N. Phan, and N. Gillis, An inertial block majorization minimization framework for nonsmooth nonconvex optimization, Journal of Machine Learning Research, 24(18) (2023), 1-41. 20
work page 2023
-
[38]
M. Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33
work page 1997
-
[39]
H. Robbins, and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 1951, 400-407
work page 1951
-
[40]
H. Robbins, and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233-257
work page 1971
-
[41]
R.T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898
work page 1976
-
[42]
C. Tan, S. Ma, Y.H. Dai, and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems 29, 2016, 685-693
work page 2016
-
[43]
R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288
work page 1996
-
[44]
Q. Tran-Dinh, N.H. Pham, D.T. Phan, and L.M. Nguyen, A hybrid stochastic optimization framework for composite nonconvex optimization, Mathematical Programming, 191(2) (2022), 1005-1071
work page 2022
- [45]
-
[46]
L. Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596
work page 2010
-
[47]
L. Xiao, and T. Zhang, A proximal stochastic gradient method with progressive variance reduc- tion, SIAM Journal on Optimization, 24(4) (2014), 2057-2075
work page 2014
- [48]
-
[49]
Y. Yang, and H. Zou, A fast unified algorithm for solving group-lasso penalize learning problems, Statistics and Computing, 25(6) (2015), 1129-1141
work page 2015
- [50]
-
[51]
H. Zou, and T. Hastie, Regularization and variable selection via the elastic net, Journal of The Royal Statistical Society Series B: Statistical Methodology, 67(2) (2005), 301-320. 21
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.