A Stochastic Composite Gradient Method with Incremental Variance Reduction
Pith reviewed 2026-05-25 17:00 UTC · model grok-4.3
The pith
A stochastic composite gradient method with incremental variance reduction matches the best known complexity for nonconvex finite-sum and expectation problems despite biased estimators from composition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased.
What carries the argument
Incremental variance-reduced estimator applied simultaneously to the inner vector mapping and its Jacobian, which controls bias introduced by the outer composition.
If this is right
- The same complexity orders apply to both expectation-based and finite-sum formulations of the composite problem.
- A wider class of machine learning models involving composition can now use incremental variance reduction without losing optimal rates.
- The method preserves the low per-iteration cost typical of incremental variance reduction while handling the extra Jacobian estimation.
Where Pith is reading between the lines
- The estimator construction may extend to other optimization settings where gradient estimates are biased by an outer nonlinear mapping.
- Practical implementations could be tested on composite losses arising in structured prediction or generative models to measure wall-clock gains.
- The analysis framework might adapt to related problems such as bilevel optimization or hyperparameter tuning that also feature composed expectations.
Load-bearing premise
The inner mapping and outer function are both smooth and the variance reduction estimator satisfies standard incremental update properties used in SVRG and SAGA analyses.
What would settle it
A concrete nonconvex composite objective where the proposed method requires more than the stated number of iterations or gradient evaluations to reach a given accuracy level.
Figures
read the original abstract
We consider the problem of minimizing the composition of a smooth (nonconvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a finite sum. We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This finding enables a much broader range of applications in machine learning to benefit from the low complexity of incremental variance-reduction methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a stochastic composite gradient method with incremental variance-reduced estimators applied to both the inner vector mapping and its Jacobian. The central claim is that the resulting algorithm attains the same iteration complexity orders as the best-known first-order variance-reduced methods for nonconvex expected-value and finite-sum problems, even though the outer composition renders the gradient estimator biased.
Significance. If the complexity result is correct, the work is significant because it shows that the bias induced by the composition can be controlled under standard smoothness assumptions without inflating the leading terms, thereby extending the reach of low-complexity incremental VR methods to a wider class of composite problems that arise in machine learning.
minor comments (3)
- The abstract would be strengthened by explicitly stating the concrete complexity orders achieved (e.g., O(1/ε²) or O(n + √n/ε) depending on the setting) rather than only claiming they match prior work.
- §2: the notation distinguishing the two incremental estimators (one for the mapping, one for the Jacobian) should be introduced earlier and with an explicit comparison to the standard SVRG/SAGA recursions to improve readability.
- The experimental section would benefit from additional baselines that isolate the effect of the Jacobian variance reduction versus using only a mapping estimator.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments under the MAJOR COMMENTS section.
Circularity Check
No significant circularity detected
full rationale
The paper derives complexity bounds for a stochastic composite gradient method using incremental variance-reduced estimators for both the inner mapping and its Jacobian. The central claim—that the method matches the best-known orders for nonconvex expected-value and finite-sum problems despite estimator bias—is established by controlling the bias term under standard smoothness assumptions on the inner and outer functions together with the incremental update properties of variance reduction (as in SVRG/SAGA). These assumptions are invoked directly in the analysis rather than being fitted to data or defined in terms of the target result; no self-citation chain, ansatz smuggling, or renaming of known empirical patterns is required for the leading terms. The derivation therefore remains self-contained against external benchmarks from the variance-reduction literature.
Axiom & Free-Parameter Ledger
free parameters (1)
- step size
axioms (2)
- domain assumption Outer function and inner mapping are smooth (Lipschitz gradients).
- domain assumption Incremental variance-reduced estimator satisfies the standard unbiasedness and variance reduction properties.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian... achieves the same orders of complexity as the best known first-order methods... despite the additional outer composition which renders the composite gradient estimator biased.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption 1... f and g_ξ Lipschitz and smooth... LF = ℓ_g² Lf + ℓf Lg
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter
Zeyuan Allen-Zhu. Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machin e Learning (ICML), volume 70 of Proceedings of Machine Learning Research , pages 89–97, Sydney, Australia, 2017
work page 2017
-
[2]
Natasha 2: Faster non-convex optimiz ation than SGD
Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimiz ation than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018
work page 2018
-
[3]
Variance reduction for faster non-convex optimization
Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machin e Learning (ICML) , pages 699–707, 2016
work page 2016
-
[4]
First-Order Methods in Optimization
Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017
work page 2017
-
[5]
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Y a nn LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchin a. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Policy evaluation with temporal differences: a survey and comparison
Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research , 15(1):809–883, 2014
work page 2014
-
[7]
SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014
work page 2014
-
[8]
Spider: Near-optimal non- convex optimization via stochastic path-integrated differ ential estimator
Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path-integrated differ ential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018
work page 2018
-
[9]
Stochastic first- and zer oth-order methods for nonconvex stochastic programming
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zer oth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013
work page 2013
-
[10]
Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Y oshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
O n graduated optimization for stochastic non-convex problems
Elad Hazan, Kfir Y ehuda Levy, and Shai Shalev-Shwartz. O n graduated optimization for stochastic non-convex problems. In International conference on machine learning , pages 1833–1841, 2016
work page 2016
-
[12]
Accelerate d method for stochastic composition optimization with nonsmooth regularization
Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerate d method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 3287–3294, 2018
work page 2018
-
[13]
Accelerating stochastic gradient descent using predictive variance reduction
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26 , pages 315–323, 2013
work page 2013
-
[14]
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016
work page 2016
-
[15]
N on-convex finite-sum optimization via SCSG methods
Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. N on-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30 , pages 2348–
-
[16]
Curran Associates, Inc., 2017
work page 2017
-
[17]
A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization
Zhize Li and Jian Li. A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 5564–5574. Curran Associates, Inc., 2018
work page 2018
-
[18]
Finite-sum compo sition optimization via variance reduced gradient descent
Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum compo sition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artific ial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017. 9
work page 2017
-
[19]
On the link between gau ssian homotopy continuation and convex envelopes
Hossein Mobahi and John W Fisher. On the link between gau ssian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in C omputer Vision and Pattern Recognition, pages 43–56. Springer, 2015
work page 2015
-
[20]
Gradient methods for minimizing compo site functions
Yurii Nesterov. Gradient methods for minimizing compo site functions. Mathematical Pro- gramming, 140(1):125–161, 2013
work page 2013
-
[21]
Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč
Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Y ee Whye Teh, editors, Proceedings of the 34th International Conference on Machin e Learning (ICML) , volume 70 of Proceedings of Machine Learning Research (PMLR) , pages 2613–2621, Sydn...
work page 2017
-
[22]
Finite-Sum Smooth Optimization with SARAH
Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha N guyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization w ith sarah. arXiv preprint, arXiv:1901.07648, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[23]
ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization
Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran- Dinh. ProxSARAH: An effi- cient algorithmic framework for stochastic composite nonc onvex optimization. arXiv preprint, arXiv:1902.05679, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[24]
Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola
Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 314–323, New Y ork, New Y ork, USA, 2016
work page 2016
-
[25]
Fast incremental method for smooth nonconvex optimization
Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex S mola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016
work page 2016
-
[26]
Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization
Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexa nder J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. I n Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016
work page 2016
-
[27]
R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970
work page 1970
-
[28]
R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. IN- FORMS TutORials in Operations Research, 2007
work page 2007
-
[29]
Advances in risk-averse optimiz ation
Andrzej Ruszczyński. Advances in risk-averse optimiz ation. INFORMS TutORials in Operation Research, 2013
work page 2013
-
[30]
Reinforcement Learning: An Introduction
Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 1998
work page 1998
-
[31]
Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions. Mathematical Programming, 161(1-2):419–449, 2017
work page 2017
-
[32]
Accelerating stoch astic composition optimization
Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stoch astic composition optimization. Journal of Machine Learning Research , 18(105):1–23, 2017
work page 2017
-
[33]
SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization
Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tar okh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018
-
[34]
A proximal stochastic gradient method with progressive variance reduction
Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization , 24(4):2057–2075, 2014
work page 2057
-
[35]
A composite randomized increm ental gradient method
Junyu Zhang and Lin Xiao. A composite randomized increm ental gradient method. In Pro- ceedings of the 36th International Conference on Machine Le arning (ICML) , number 97 in Proceedings of Machine Learning Research (PMLR), Long Beac h, California, 2019
work page 2019
-
[36]
Stochastic neste d variance reduced gradient descent for nonconvex optimization
Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic neste d variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 3921–3932. Curran Associates, Inc., 2018. 10 Appendices A Convergence analysis for composite expectation case In this section, we focus on convergence analysis of CIVR for s...
work page 2018
-
[37]
− Φ∗ + 3σ2 0 η 4B τT, where we have applied the fact that xt 0 = xt− 1 τ . By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1
-
[38]
(45) Substitute the values of T, τand B gives (22)
− Φ∗) τT η + 6σ2 0 B . (45) Substitute the values of T, τand B gives (22). To simplify presentation, we omit ⌈·⌉ on integer parameters in the following discussion. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ √ ǫ, B = σ2 0 / ǫ, and τ= S = 1/ √ ǫ, we have E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1
-
[39]
− Φ∗) η− 1 + 6) ǫ, and the sample complexity is T ( B + 2τS) = O (σ2 0 ǫ− 3/ 2 + ǫ− 3/ 2) , as in our theorem. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1
-
[40]
For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity
− Φ∗) η− 1 + 6) ǫ, but the sample complexity isT ( B+2τS) = O (σ2 0 ǫ− 2 +ǫ− 1) , which is same as in Ghadimi and Lan [9]. For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity. 14 A.2 Proof of Theorem 2 Proof. Note that for this set of parameters, we still have the relati onship that τt = St . Therefore, within each epoch, (44...
-
[41]
− Φ∗ + T∑ t=1 3σ2 0 η 4Bt τt . (46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1
-
[42]
(47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉
− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T t=1 τt / Bt ∑T t=1 τt . (47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉. We have T∑ t=1 τt ≥ T∑ t=1 at + b = a 2 T ( T + 1) + bT = O( T 2) and σ2 0 T∑ t=1 τt / Bt ≤ T∑ t=1 1 at + b ≤ 1 a + b + /uni222B.dsp T 1 dt at + b = 1 a + b + 1 a ln ( aT + b a + b ) = O( ln T ) . Substituting the above bounds into inequality (4...
-
[43]
This is more close to the classical results on stochastic opt imization
− Φ∗) ηT + 6 lnT T , but the sample complexity, by setting T = ˜O( ǫ− 1) so that the above bound is less than ǫ, is T∑ t=1 (Bt + 2τt St ) ≤ T∑ t=1 ( σ2 0 ( at + b) + 2 ) = O( σ2 0 T 2 + T ) = ˜O( σ2 0 ǫ− 2 + ǫ− 1) . This is more close to the classical results on stochastic opt imization. B Convergence analysis for composite finite-sum case In this section,...
-
[44]
Namely, in (11) of Algorithm 1, we choose Bt = { 1,
by their exact value rather than the approximate ones constructed by subsa mpling. Namely, in (11) of Algorithm 1, we choose Bt = { 1, . . . , n} for all t ≥ 1. Therefore, yt 0 = g( xt
-
[45]
= 1 n n∑ j=1 gj ( xt
-
[46]
= 1 n n∑ j=1 g′ j ( xt 0) and E [ ∥yt 0 − g( xt 0)∥ 2] = 0 , E [ ∥zt 0 − g′( xk 0 )∥ 2] = 0 . (48) 15 As a result, the initial variances in Lemma 1 diminishes and ( 33) reduces to E[∥yt i − g( xt i )∥ 2] ≤ i∑ r=1 ℓ2 g St E[∥xt r − xt r− 1 ∥2] , E[∥zt i − g′( xt i )∥ 2] ≤ i∑ r=1 L2 g St E[∥xt r − xt r− 1∥2] . (49) In addition, com...
-
[47]
− Φ∗) τT η = 8( Φ( x1
-
[48]
Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution
− Φ∗) √ nTη . Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution. Note that the sample complexity per epoch is n + τt St = 2n, the total sample complexity will be O( n + √ nǫ− 1) . B.2 Proof of Theorem 4 Proof. If T ≤ T0, then the result is exactly what we proved from Theorem 2. The refore, the first bound in (26) is already guaranteed. If T > T...
-
[49]
− E[Φ( xT0+1 0 )] + T0∑ t=1 3σ2 0 η 4Bt τt . (51) When T0 + 1 ≤ t ≤ T , the following bound becomes effective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗. Therefore, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1
-
[50]
− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T0 t=1 τt / Bt ∑T t=1 τt . Note that T0∑ t=1 τt / Bt ≤ T0∑ t=1 1 at + b ≤ 1 a + b + 1 a ln ( aT0 + b a + b ) = O( ln n) , and T∑ t=1 τt ≥ ( T − T0) √ n + T0∑ t=1 ( at + b) = √ n( T − T0) + a 2 T 2 0 + ( a 2 + b) T0 = O( √ n( T − T0 + 1)) . 16 With the above two bounds, we have proved the second result in (26). For any ǫ >0, if ...
-
[51]
− F ∗) τT η + 6νσ2 0 B By the selection of T = ⌈ 16ν√ ǫ η ⌉ , τ= 1/ √ ǫand B = 1 + 12νσ2 0 / ǫ, we have E[F( ¯x) − F ∗] ≤ 1 2 ( F( x1
-
[52]
− F ∗) + 1 2 ǫ, (52) which is (28). Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2, ... denotes the number of restarts. We use the output of the kth period ¯xk as the initial point to start the next period, which produces ¯ xk+1. As a result, the above inequality translates to E[F( ¯xk...
-
[53]
C.2 Proof of Theorem 6 The proof is very similar to the previous one
− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 16ν η ( 12νσ2 0 ǫ + 2 ) ln 1 ǫ= O ( ν2σ2 0 ǫ− 1/ 2 + ν ) ln ǫ− 1 Defining the condition number κ= LF ν= O( ν/ η) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. C.2 Proof of Theorem 6...
-
[54]
By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n
− F ∗) τT η . By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n. we again obtain (52). In this case, we have B = n and T ( B + 2τS) ǫ− 1 = ⌈ 16ν η√ n ⌉ ( n + 2√ n√ n ) ln ǫ− 1 = O ( n + ν√ n ) ln ǫ− 1. D Convergence analysis under optimally strong convexity In order to prove Theorems 7 and 8, we first state Lemma 3 in [33] in our notations. Lemma 5 (Lemma 3 in [33...
-
[55]
+ 2µ∥xt 0 − x∗∥2] −( 1 2η− LF 2 − τt 9G0η 2St ) τt∑ r=1 E[∥xt r − xt r− 1∥2] + τt 9σ2 0 η 2Bt . According to the selection of τt, St, Bt and η, we know that the coefficient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0. Consequently, 2µη τt − 1∑ i=0 E[Φ( xt i+1) − Φ∗] ≤ E[Φ( xt τt ) + 2µ∥xt τt − x∗∥2] − E[Φ( xt
-
[56]
+ 2µ∥xt 0 − x∗∥2] + τt 9σ2 0 η 2Bt . Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt . If we choose T = ⌈5√ ǫ µη ⌉, τ= S = 1√ ǫ and Bt = 1 + 9σ 2 0 2µǫ , then 5 2µητ T ≤ 1 2 and we obtain E[Φ( ¯x) − Φ∗] ≤ 1 2 E[Φ( x1
-
[57]
This proves the inequality (31)
− Φ∗] + 1 2 ǫ . This proves the inequality (31). The rest of the proof will mi mic that of Theorem 5. Discussions on sample complexity: • If we choose τ= S = 1/ √ ǫ, Bt = 1 + 9σ 2 0 2µǫ , and T = ⌈5√ ǫ µη ⌉, then the sample complexity is T ( B + 2τS) ln 1 ǫ= 5√ ǫ µη ( 9σ2 0 2µǫ+ 1√ ǫ 1√ ǫ ) ln 1 ǫ= O ( ( µ− 2σ2 0 ǫ− 1/ 2 + µ− 1ǫ− 1/ 2) ln ǫ− 1 ) . The abo...
-
[58]
D.2 Proof of Theorem 8 The proof is very similar to the previous one
− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 5 µη ( 9σ2 0 µǫ+ 2 ) ln 1 ǫ= O ( µ− 2σ2 0 ǫ− 1 + µ− 1 ) ln ǫ− 1 Defining the condition number κ= LF ν= O( 1/( µη)) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. D.2 Proof of Theorem...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.