A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization

Changjie Fang; Hao Yang; Shenglan Chen

arxiv: 2509.11043 · v3 · submitted 2025-09-14 · 🧮 math.OC

A Proximal Stochastic Gradient Method with Adaptive Step Size and Variance Reduction for Convex Composite Optimization

Changjie Fang , Hao Yang , Shenglan Chen This is my paper

Pith reviewed 2026-05-18 17:15 UTC · model grok-4.3

classification 🧮 math.OC

keywords proximal stochastic gradientvariance reductionadaptive step sizeconvex composite optimizationstrong convergenceO(sqrt(1/k)) ratelogistic regressionlasso regression

0 comments

The pith

A proximal stochastic gradient algorithm with variance reduction and adaptive steps converges strongly at an O(sqrt(1/k)) rate for convex composite problems under Lipschitz continuity of the smooth part.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the PSGA method for composite optimization problems consisting of one smooth convex function and one non-smooth convex function. By combining proximal updates with variance-reduced stochastic gradients and an adaptive step-size rule, the method achieves strong convergence of the iterates when the smooth function is Lipschitz continuous. The authors further show that the expected error between the estimated and true gradient vanishes and that the method attains an O(sqrt(1/k)) convergence rate, with supporting numerical results on logistic regression and Lasso regression.

Core claim

The PSGA algorithm establishes strong convergence of the generated sequence together with an O(sqrt(1/k)) rate on the expected error, provided the smooth convex component is Lipschitz continuous, while also proving that the variance-reduced gradient estimate converges to the true gradient in expectation.

What carries the argument

The PSGA iteration, which performs a proximal step on the non-smooth part after taking a variance-reduced stochastic gradient step on the smooth part with an adaptive step-size choice.

If this is right

The sequence of iterates converges strongly to a minimizer under the stated Lipschitz condition.
The expected difference between the variance-reduced gradient and the true gradient tends to zero.
An O(sqrt(1/k)) rate holds for the convergence of the objective values or the distance to optimality.
The method is effective on logistic regression and Lasso regression instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive step-size rule may allow the method to remain stable on problems where a fixed step size would require careful tuning.
The variance-reduction component could be swapped for other estimators such as those in SVRG-style methods while preserving the same convergence guarantees.
The framework naturally supports large-scale regularized problems in machine learning where the non-smooth term encodes sparsity or other structure.

Load-bearing premise

The smooth convex function is Lipschitz continuous.

What would settle it

A concrete counterexample in which the smooth function violates Lipschitz continuity and the iterates fail to converge strongly or the observed rate falls below O(sqrt(1/k)).

read the original abstract

In this paper, we propose a proximal stochasitc gradient algorithm (PSGA) for solving composite optimization problems by incorporating variance reduction techniques and an adaptive step-size strategy. In the PSGA method, the objective function consists of two components: one is a smooth convex function, and the other is a non-smooth convex function. We establish the strong convergence of the proposed method, provided that the smooth convex function is Lipschitz continuous. We also prove that the expected value of the error between the estimated gradient and the actual gradient converges to zero. Furthermore, we get an \( O(\sqrt{1/k}) \) convergence rate for our method. Finally, the effectiveness of the proposed method is validated through numerical experiments on Logistic regression and Lasso regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds an adaptive step size to a variance-reduced proximal stochastic gradient method for convex composite problems and proves the standard strong convergence plus O(sqrt(1/k)) rate under Lipschitz continuity.

read the letter

The key takeaway is that this paper gives a proximal stochastic gradient method with variance reduction and adaptive steps for convex composite problems, and it proves strong convergence plus an O(sqrt(1/k)) rate under Lipschitz continuity of the smooth part. The work combines the proximal operator for the nonsmooth convex term with a variance-reduced gradient estimator and an adaptive step-size rule. They show that the bias or error in the estimated gradient vanishes in expectation, which lets them bound the distance to the optimum and get almost-sure convergence. The stress-test note indicates the logical steps hold without obvious holes like missing summability or mishandled proximal terms. On the positive side, the paper includes numerical tests on logistic regression and lasso problems. These are standard benchmarks, so they give a sense of practical behavior. The main limitation is that the result is incremental. Many papers already handle proximal composite optimization with variance reduction, and the rate here is the typical one for non-strongly convex cases. The adaptive step size is presented as a feature, but without a direct comparison to fixed-step versions or other adaptive schemes, it's hard to gauge how much it helps. The abstract mentions strong convergence, which is good, but the experiments lack details on variance or multiple runs that would make the validation more robust. Overall, the paper is for researchers in stochastic optimization who follow developments in proximal and variance-reduced methods. Someone working on similar algorithms would get value from seeing this particular combination and the proof details. It is coherent enough on its own terms to merit a full review. I would send this to peer review. The claims are checkable, the assumptions are standard, and the contribution, while not revolutionary, adds a concrete algorithm that could be useful.

Referee Report

0 major / 3 minor

Summary. The paper proposes a proximal stochastic gradient algorithm (PSGA) for convex composite optimization problems of the form minimize f(x) + g(x), where f is smooth convex and g is non-smooth convex. The method combines variance reduction with an adaptive step-size rule. The authors claim to prove strong convergence of the iterates under the assumption that the smooth convex function is Lipschitz continuous, to show that the expected error of the variance-reduced gradient estimator to the true gradient converges to zero, to derive an O(sqrt(1/k)) convergence rate, and to validate the approach via numerical experiments on logistic regression and Lasso regression.

Significance. If the convergence analysis is correct, the work adds a variant of variance-reduced proximal stochastic gradient methods that incorporates adaptive stepsizes while retaining strong convergence and a standard sublinear rate. This combination may be useful for large-scale machine-learning problems where fixed-step methods are inefficient. The numerical validation on standard regression tasks is appropriate, though the contribution would be strengthened by more extensive baselines and statistical reporting.

minor comments (3)

[Abstract] Abstract: 'proximal stochasitc gradient algorithm' contains a typo and should read 'stochastic'.
[Abstract] Abstract: The assumption is phrased as 'the smooth convex function is Lipschitz continuous'. Standard terminology in this setting is that the gradient of the smooth function is Lipschitz continuous (i.e., f is L-smooth). Please clarify the precise assumption and its placement in the main text.
[Numerical experiments] Numerical experiments section: The description supplies no information on the number of independent runs, error bars, or explicit baseline comparisons (e.g., against SVRG or standard proximal SGD). Adding these details would improve reproducibility and allow readers to assess practical gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. The referee summary accurately captures the main elements of our work: the PSGA algorithm combining proximal stochastic gradient steps with variance reduction and adaptive step sizes, the strong convergence result for convex composite problems under the Lipschitz continuity assumption on the smooth term, the proof that the expected error of the variance-reduced estimator converges to zero, the O(sqrt(1/k)) rate, and the numerical validation on logistic and Lasso regression. No specific major comments were enumerated in the report, so we provide no point-by-point rebuttals to major comments. We acknowledge the observation that more extensive baselines and statistical reporting would strengthen the experiments and will address this in revision.

Circularity Check

0 steps flagged

No significant circularity; analysis rests on standard assumptions

full rationale

The paper proposes PSGA for composite convex optimization and proves strong convergence plus O(sqrt(1/k)) rate under the assumption that the smooth component has Lipschitz continuous gradient. The key steps show that the variance-reduced estimator has error expectation converging to zero, which is then used to bound the proximal mapping iterates and obtain almost-sure convergence. These relations follow from convexity, unbiasedness of the stochastic gradients, and standard summability arguments on the adaptive steps; none of the quantities are defined in terms of the final convergence rate or fitted to the outputs they are used to prove. The derivation is therefore self-contained against external benchmarks and does not reduce to self-definition, fitted-input renaming, or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions of convexity and Lipschitz continuity of the smooth term; no free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (2)

domain assumption The smooth component of the objective is convex and Lipschitz continuous.
Explicitly required for the strong convergence and rate statements in the abstract.
domain assumption The non-smooth component is convex (so that the proximal operator is well-defined).
Implicit in the use of proximal stochastic gradient steps for composite problems.

pith-pipeline@v0.9.0 · 5657 in / 1325 out tokens · 38618 ms · 2026-05-18T17:15:15.649956+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish the strong convergence of the proposed method, provided that the smooth convex function is Lipschitz continuous. ... we get an O(sqrt(1/k)) convergence rate
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Step 5. Set step size: If τk ≥ ηk−1, set ηk = (1 + 1/τk) ηk−1 ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

L. Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

work page 1966
[2]

Ash, and C.A

R.B. Ash, and C.A. Doleans-Dade, Probability and Measure Theory, Academic Press, 2000

work page 2000
[3]

Barzilai, and J.M

J. Barzilai, and J.M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8(1) (1988), 141-148

work page 1988
[4]

Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

A. Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

work page 2017
[5]

Bottou, F.E

L. Bottou, F.E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60(2) (2018), 223-311

work page 2018
[6]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3(1) (2011), 1-122

work page 2011
[7]

Burdakov, Y

O. Burdakov, Y. Dai, and N. Huang, Stabilized Barzilai-Borwein method, Journal of Computa- tional Mathematics, 37(6) (2019), 916-936

work page 2019
[8]

Candes, and M.B

E.J. Candes, and M.B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, 25(2) (2008), 21-30

work page 2008
[9]

Chang, and C.J

C.C. Chang, and C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2(3) (2011), 1-27

work page 2011
[10]

Chen, F.E

T. Chen, F.E. Curtis, and D.P. Robinson, A reduced-space algorithm for minimizing ℓ1- regularized convex functions, SIAM Journal on Optimization, 27(3) (2017), 1583-1610

work page 2017
[11]

Cutkosky, and F

A. Cutkosky, and F. Orabona, Momentum-based variance reduction in non-convex SGD, Ad- vances in Neural Information Processing Systems 32, 2019, 15236-15245

work page 2019
[12]

Y. Dai, G. Wang, F.E. Curtis, and D.P. Robinson, A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization, In International Conference on Artificial Intelligence and Statistics, 2023, 5107-5133

work page 2023
[13]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27, 2014, 1646-1654

work page 2014
[14]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochas- tic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159

work page 2011
[15]

Durrett, Probability: theory and examples

R. Durrett, Probability: theory and examples. Vol. 49. Cambridge University Press, 2019

work page 2019
[16]

C. Fang, L. Hu, and S. Chen, An inexact primal-dual method with correction step for a saddle point problem in image debluring, Journal of Global Optimization, 87(2)(2023), 965-988

work page 2023
[17]

Grimmer, S

B. Grimmer, S. Kevin, and A.L. Wang, Accelerated gradient descent via long steps, arXiv preprint arXiv:2309.09961 (2023)

work page arXiv 2023
[18]

Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol

T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, New York: Springer, 2009

work page 2009
[19]

Huang, and C

Z.S. Huang, and C. Lee, Training structured neural networks through manifold identification and variance reduction, arXiv preprint arXiv:2112.02612 (2021). 19

work page arXiv 2021
[20]

Johnson, and T

R. Johnson, and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26, 2013, 315-323

work page 2013
[21]

Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

A.N. Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

work page 2018
[22]

G. Lan, Y. Ouyang, and Z. Zhang, Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, arXiv preprint arXiv:2310.12139 (2023)

work page arXiv 2023
[23]

Latafat, A

P. Latafat, A. Themelis, L. Stella, and P. Patrinos, Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient, Mathematical Programming, 2024. https://doi.org/10.1007/s10107-024-02143-7

work page doi:10.1007/s10107-024-02143-7 2024
[24]

Latafat, A

P. Latafat, A. Themelis, and P. Patrinos, On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms, 6th Annual Learning for Dynamics and Control Conference, 2024, 197-208

work page 2024
[25]

Li, and G

T. Li, and G. Lan, A simple uniformly optimal method without line search for convex optimiza- tion, arXiv preprint arXiv:2310.10082 (2023)

work page arXiv 2023
[26]

Lin, Probability inequalities, Springer, 2010

Z. Lin, Probability inequalities, Springer, 2010

work page 2010
[27]

Liu, T.D

Z. Liu, T.D. Nguyen, T.H. Nguyen, A. Ene, and H.L. Nguyen, META-STORM: Generalized fully-adaptive variance reduced SGD for unbounded functions, arXiv preprint arXiv:2209.14853 (2022)

work page arXiv 2022
[28]

Malitsky, and K

Y. Malitsky, and K. Mishchenko, Adaptive gradient descent without descent, arXiv preprint arXiv:1910.09529 (2019)

work page arXiv 1910
[29]

Malitsky, and K

Y. Malitsky, and K. Mishchenko, Adaptive proximal gradient method for convex optimization, Advances in Neural Information Processing Systems 37, 2024, 100670-100697

work page 2024
[30]

Milzarek, F

A. Milzarek, F. Schaipp, and M. Ulbrich, A semismooth Newton stochastic proximal point algorithm with variance reduction, SIAM Journal on Optimization, 34(1) (2024), 1157-1185

work page 2024
[31]

S. Na, M. Derezinski, and M.W. Mahoney, Hessian averaging in stochastic Newton methods achieves superlinear convergence, Mathematical Programming, 201(1) (2023), 473-520

work page 2023
[32]

Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

M. Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

work page 2025
[33]

Nesterov, Lectures on convex optimization, Vol

Y. Nesterov, Lectures on convex optimization, Vol. 137, Berlin: Springer International Publish- ing, 2018

work page 2018
[34]

Nguyen, J

L.M. Nguyen, J. Liu, K. Scheinberg, and M. Takac, SARAH: A novel method for machine learn- ing problems using stochastic recursive gradient, International Conference on Machine Learning, 2017, 2613-2621

work page 2017
[35]

Pham, L.M

N.H. Pham, L.M. Nguyen, D.T. Phan, and Q. Tran-Dinh, ProxSARAH: An efficient algorith- mic framework for stochastic composite nonconvex optimization, Journal of Machine Learning Research, 21(110) (2020), 1-48

work page 2020
[36]

D.N. Phan, S. Bartz, N. Guha, and H.M. Phan, Stochastic Variance-Reduced Majorization- Minimization Algorithms, SIAM Journal on Mathematics of Data Science, 6(4) (2024), 926-952

work page 2024
[37]

Phan, and N

D.N. Phan, and N. Gillis, An inertial block majorization minimization framework for nonsmooth nonconvex optimization, Journal of Machine Learning Research, 24(18) (2023), 1-41. 20

work page 2023
[38]

Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

M. Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

work page 1997
[39]

Robbins, and S

H. Robbins, and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 1951, 400-407

work page 1951
[40]

Robbins, and D

H. Robbins, and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233-257

work page 1971
[41]

Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

R.T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

work page 1976
[42]

C. Tan, S. Ma, Y.H. Dai, and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems 29, 2016, 685-693

work page 2016
[43]

Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

work page 1996
[44]

Tran-Dinh, N.H

Q. Tran-Dinh, N.H. Pham, D.T. Phan, and L.M. Nguyen, A hybrid stochastic optimization framework for composite nonconvex optimization, Mathematical Programming, 191(2) (2022), 1005-1071

work page 2022
[45]

Traore, V

C. Traore, V. Apidopoulos, S. Salzo, and S. Villa, Variance reduction techniques for stochastic proximal point algorithms, Journal of Optimization Theory and Applications, 203(2) (2024), 1910-1939

work page 2024
[46]

Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

L. Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

work page 2010
[47]

Xiao, and T

L. Xiao, and T. Zhang, A proximal stochastic gradient method with progressive variance reduc- tion, SIAM Journal on Optimization, 24(4) (2014), 2057-2075

work page 2014
[48]

Xu, and Y

Y. Xu, and Y. Xu, Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization, Journal of Optimization Theory and Applications, 196(1) (2023), 266-297

work page 2023
[49]

Yang, and H

Y. Yang, and H. Zou, A fast unified algorithm for solving group-lasso penalize learning problems, Statistics and Computing, 25(6) (2015), 1129-1141

work page 2015
[50]

D. Zhou, S. Ma, and J. Yang, AdaBB: Adaptive Barzilai-Borwein Method for Convex Optimiza- tion, arXiv preprint arXiv:2401.08024 (2024)

work page arXiv 2024
[51]

Zou, and T

H. Zou, and T. Hastie, Regularization and variable selection via the elastic net, Journal of The Royal Statistical Society Series B: Statistical Methodology, 67(2) (2005), 301-320. 21

work page 2005

[1] [1]

Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

L. Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, 16(1) (1966), 1-3

work page 1966

[2] [2]

Ash, and C.A

R.B. Ash, and C.A. Doleans-Dade, Probability and Measure Theory, Academic Press, 2000

work page 2000

[3] [3]

Barzilai, and J.M

J. Barzilai, and J.M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8(1) (1988), 141-148

work page 1988

[4] [4]

Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

A. Beck, First-order methods in optimization, Society for Industrial and Applied Mathematics, 2017

work page 2017

[5] [5]

Bottou, F.E

L. Bottou, F.E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60(2) (2018), 223-311

work page 2018

[6] [6]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3(1) (2011), 1-122

work page 2011

[7] [7]

Burdakov, Y

O. Burdakov, Y. Dai, and N. Huang, Stabilized Barzilai-Borwein method, Journal of Computa- tional Mathematics, 37(6) (2019), 916-936

work page 2019

[8] [8]

Candes, and M.B

E.J. Candes, and M.B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, 25(2) (2008), 21-30

work page 2008

[9] [9]

Chang, and C.J

C.C. Chang, and C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2(3) (2011), 1-27

work page 2011

[10] [10]

Chen, F.E

T. Chen, F.E. Curtis, and D.P. Robinson, A reduced-space algorithm for minimizing ℓ1- regularized convex functions, SIAM Journal on Optimization, 27(3) (2017), 1583-1610

work page 2017

[11] [11]

Cutkosky, and F

A. Cutkosky, and F. Orabona, Momentum-based variance reduction in non-convex SGD, Ad- vances in Neural Information Processing Systems 32, 2019, 15236-15245

work page 2019

[12] [12]

Y. Dai, G. Wang, F.E. Curtis, and D.P. Robinson, A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization, In International Conference on Artificial Intelligence and Statistics, 2023, 5107-5133

work page 2023

[13] [13]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27, 2014, 1646-1654

work page 2014

[14] [14]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochas- tic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159

work page 2011

[15] [15]

Durrett, Probability: theory and examples

R. Durrett, Probability: theory and examples. Vol. 49. Cambridge University Press, 2019

work page 2019

[16] [16]

C. Fang, L. Hu, and S. Chen, An inexact primal-dual method with correction step for a saddle point problem in image debluring, Journal of Global Optimization, 87(2)(2023), 965-988

work page 2023

[17] [17]

Grimmer, S

B. Grimmer, S. Kevin, and A.L. Wang, Accelerated gradient descent via long steps, arXiv preprint arXiv:2309.09961 (2023)

work page arXiv 2023

[18] [18]

Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol

T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, New York: Springer, 2009

work page 2009

[19] [19]

Huang, and C

Z.S. Huang, and C. Lee, Training structured neural networks through manifold identification and variance reduction, arXiv preprint arXiv:2112.02612 (2021). 19

work page arXiv 2021

[20] [20]

Johnson, and T

R. Johnson, and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26, 2013, 315-323

work page 2013

[21] [21]

Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

A.N. Kolmogorov, Foundations of the theory of probability: Second English Edition, Courier Dover Publications, 2018

work page 2018

[22] [22]

G. Lan, Y. Ouyang, and Z. Zhang, Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, arXiv preprint arXiv:2310.12139 (2023)

work page arXiv 2023

[23] [23]

Latafat, A

P. Latafat, A. Themelis, L. Stella, and P. Patrinos, Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient, Mathematical Programming, 2024. https://doi.org/10.1007/s10107-024-02143-7

work page doi:10.1007/s10107-024-02143-7 2024

[24] [24]

Latafat, A

P. Latafat, A. Themelis, and P. Patrinos, On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms, 6th Annual Learning for Dynamics and Control Conference, 2024, 197-208

work page 2024

[25] [25]

Li, and G

T. Li, and G. Lan, A simple uniformly optimal method without line search for convex optimiza- tion, arXiv preprint arXiv:2310.10082 (2023)

work page arXiv 2023

[26] [26]

Lin, Probability inequalities, Springer, 2010

Z. Lin, Probability inequalities, Springer, 2010

work page 2010

[27] [27]

Liu, T.D

Z. Liu, T.D. Nguyen, T.H. Nguyen, A. Ene, and H.L. Nguyen, META-STORM: Generalized fully-adaptive variance reduced SGD for unbounded functions, arXiv preprint arXiv:2209.14853 (2022)

work page arXiv 2022

[28] [28]

Malitsky, and K

Y. Malitsky, and K. Mishchenko, Adaptive gradient descent without descent, arXiv preprint arXiv:1910.09529 (2019)

work page arXiv 1910

[29] [29]

Malitsky, and K

Y. Malitsky, and K. Mishchenko, Adaptive proximal gradient method for convex optimization, Advances in Neural Information Processing Systems 37, 2024, 100670-100697

work page 2024

[30] [30]

Milzarek, F

A. Milzarek, F. Schaipp, and M. Ulbrich, A semismooth Newton stochastic proximal point algorithm with variance reduction, SIAM Journal on Optimization, 34(1) (2024), 1157-1185

work page 2024

[31] [31]

S. Na, M. Derezinski, and M.W. Mahoney, Hessian averaging in stochastic Newton methods achieves superlinear convergence, Mathematical Programming, 201(1) (2023), 473-520

work page 2023

[32] [32]

Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

M. Neri, A finitary Kronecker’s lemma and large deviations in the strong law of large numbers on Banach spaces, Annals of Pure and Applied Logic, 176(6) (2025), 103569

work page 2025

[33] [33]

Nesterov, Lectures on convex optimization, Vol

Y. Nesterov, Lectures on convex optimization, Vol. 137, Berlin: Springer International Publish- ing, 2018

work page 2018

[34] [34]

Nguyen, J

L.M. Nguyen, J. Liu, K. Scheinberg, and M. Takac, SARAH: A novel method for machine learn- ing problems using stochastic recursive gradient, International Conference on Machine Learning, 2017, 2613-2621

work page 2017

[35] [35]

Pham, L.M

N.H. Pham, L.M. Nguyen, D.T. Phan, and Q. Tran-Dinh, ProxSARAH: An efficient algorith- mic framework for stochastic composite nonconvex optimization, Journal of Machine Learning Research, 21(110) (2020), 1-48

work page 2020

[36] [36]

D.N. Phan, S. Bartz, N. Guha, and H.M. Phan, Stochastic Variance-Reduced Majorization- Minimization Algorithms, SIAM Journal on Mathematics of Data Science, 6(4) (2024), 926-952

work page 2024

[37] [37]

Phan, and N

D.N. Phan, and N. Gillis, An inertial block majorization minimization framework for nonsmooth nonconvex optimization, Journal of Machine Learning Research, 24(18) (2023), 1-41. 20

work page 2023

[38] [38]

Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

M. Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained mini- mization problem, SIAM Journal on Optimization, 7(1) (1997), 26-33

work page 1997

[39] [39]

Robbins, and S

H. Robbins, and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 1951, 400-407

work page 1951

[40] [40]

Robbins, and D

H. Robbins, and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233-257

work page 1971

[41] [41]

Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

R.T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14(5) (1976), 877-898

work page 1976

[42] [42]

C. Tan, S. Ma, Y.H. Dai, and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems 29, 2016, 685-693

work page 2016

[43] [43]

Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of The Royal Statistical Society Series B: Statistical Methodology, 58(1) (1996), 267-288

work page 1996

[44] [44]

Tran-Dinh, N.H

Q. Tran-Dinh, N.H. Pham, D.T. Phan, and L.M. Nguyen, A hybrid stochastic optimization framework for composite nonconvex optimization, Mathematical Programming, 191(2) (2022), 1005-1071

work page 2022

[45] [45]

Traore, V

C. Traore, V. Apidopoulos, S. Salzo, and S. Villa, Variance reduction techniques for stochastic proximal point algorithms, Journal of Optimization Theory and Applications, 203(2) (2024), 1910-1939

work page 2024

[46] [46]

Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

L. Xiao, Dual averaging method for regularized stochastic learning and online optimization, The Journal of Machine Learning Research, 11 (2010), 2543-2596

work page 2010

[47] [47]

Xiao, and T

L. Xiao, and T. Zhang, A proximal stochastic gradient method with progressive variance reduc- tion, SIAM Journal on Optimization, 24(4) (2014), 2057-2075

work page 2014

[48] [48]

Xu, and Y

Y. Xu, and Y. Xu, Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization, Journal of Optimization Theory and Applications, 196(1) (2023), 266-297

work page 2023

[49] [49]

Yang, and H

Y. Yang, and H. Zou, A fast unified algorithm for solving group-lasso penalize learning problems, Statistics and Computing, 25(6) (2015), 1129-1141

work page 2015

[50] [50]

D. Zhou, S. Ma, and J. Yang, AdaBB: Adaptive Barzilai-Borwein Method for Convex Optimiza- tion, arXiv preprint arXiv:2401.08024 (2024)

work page arXiv 2024

[51] [51]

Zou, and T

H. Zou, and T. Hastie, Regularization and variable selection via the elastic net, Journal of The Royal Statistical Society Series B: Statistical Methodology, 67(2) (2005), 301-320. 21

work page 2005