pith. sign in

arxiv: 1906.10186 · v1 · pith:EWB5ZLMVnew · submitted 2019-06-24 · 🧮 math.OC · stat.ML

A Stochastic Composite Gradient Method with Incremental Variance Reduction

Pith reviewed 2026-05-25 17:00 UTC · model grok-4.3

classification 🧮 math.OC stat.ML
keywords stochastic optimizationvariance reductioncomposite functionsnonconvex optimizationfinite-sum problemsgradient methodsexpectation minimizationmachine learning
0
0 comments X

The pith

A stochastic composite gradient method with incremental variance reduction matches the best known complexity for nonconvex finite-sum and expectation problems despite biased estimators from composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to minimize the composition of a smooth nonconvex outer function with a smooth inner vector mapping expressed as an expectation or finite sum. It introduces an incremental variance-reduced estimator that tracks both the inner mapping and its Jacobian. The central result is that this approach attains the same complexity orders as leading first-order methods for ordinary nonconvex stochastic problems. The result holds even though the outer composition introduces bias into the gradient estimator. This extends the reach of low-complexity variance reduction techniques to a larger set of machine learning objectives.

Core claim

We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased.

What carries the argument

Incremental variance-reduced estimator applied simultaneously to the inner vector mapping and its Jacobian, which controls bias introduced by the outer composition.

If this is right

  • The same complexity orders apply to both expectation-based and finite-sum formulations of the composite problem.
  • A wider class of machine learning models involving composition can now use incremental variance reduction without losing optimal rates.
  • The method preserves the low per-iteration cost typical of incremental variance reduction while handling the extra Jacobian estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The estimator construction may extend to other optimization settings where gradient estimates are biased by an outer nonlinear mapping.
  • Practical implementations could be tested on composite losses arising in structured prediction or generative models to measure wall-clock gains.
  • The analysis framework might adapt to related problems such as bilevel optimization or hyperparameter tuning that also feature composed expectations.

Load-bearing premise

The inner mapping and outer function are both smooth and the variance reduction estimator satisfies standard incremental update properties used in SVRG and SAGA analyses.

What would settle it

A concrete nonconvex composite objective where the proposed method requires more than the stated number of iterations or gradient evaluations to reach a given accuracy level.

Figures

Figures reproduced from arXiv: 1906.10186 by Junyu Zhang, Lin Xiao.

Figure 1
Figure 1. Figure 1: Experiments on the risk-averse portfolio optimiz [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experiments on policy evaluation for MDP for cases [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
read the original abstract

We consider the problem of minimizing the composition of a smooth (nonconvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a finite sum. We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This finding enables a much broader range of applications in machine learning to benefit from the low complexity of incremental variance-reduction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a stochastic composite gradient method with incremental variance-reduced estimators applied to both the inner vector mapping and its Jacobian. The central claim is that the resulting algorithm attains the same iteration complexity orders as the best-known first-order variance-reduced methods for nonconvex expected-value and finite-sum problems, even though the outer composition renders the gradient estimator biased.

Significance. If the complexity result is correct, the work is significant because it shows that the bias induced by the composition can be controlled under standard smoothness assumptions without inflating the leading terms, thereby extending the reach of low-complexity incremental VR methods to a wider class of composite problems that arise in machine learning.

minor comments (3)
  1. The abstract would be strengthened by explicitly stating the concrete complexity orders achieved (e.g., O(1/ε²) or O(n + √n/ε) depending on the setting) rather than only claiming they match prior work.
  2. §2: the notation distinguishing the two incremental estimators (one for the mapping, one for the Jacobian) should be introduced earlier and with an explicit comparison to the standard SVRG/SAGA recursions to improve readability.
  3. The experimental section would benefit from additional baselines that isolate the effect of the Jacobian variance reduction versus using only a mapping estimator.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments under the MAJOR COMMENTS section.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives complexity bounds for a stochastic composite gradient method using incremental variance-reduced estimators for both the inner mapping and its Jacobian. The central claim—that the method matches the best-known orders for nonconvex expected-value and finite-sum problems despite estimator bias—is established by controlling the bias term under standard smoothness assumptions on the inner and outer functions together with the incremental update properties of variance reduction (as in SVRG/SAGA). These assumptions are invoked directly in the analysis rather than being fitted to data or defined in terms of the target result; no self-citation chain, ansatz smuggling, or renaming of known empirical patterns is required for the leading terms. The derivation therefore remains self-contained against external benchmarks from the variance-reduction literature.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper relies on standard smoothness assumptions for the outer function and inner mapping, plus properties of incremental variance reduction estimators. No new entities are invented. Step-size parameters are chosen based on problem constants but are not data-fitted in the usual sense.

free parameters (1)
  • step size
    Chosen based on smoothness constants and variance bounds to achieve the stated complexity; typical in convergence analyses.
axioms (2)
  • domain assumption Outer function and inner mapping are smooth (Lipschitz gradients).
    Standard assumption for first-order methods; invoked to bound the bias and variance in the composite estimator.
  • domain assumption Incremental variance-reduced estimator satisfies the standard unbiasedness and variance reduction properties.
    Relies on properties from prior SVRG/SAGA literature.

pith-pipeline@v0.9.0 · 5628 in / 1290 out tokens · 19419 ms · 2026-05-25T17:00:15.376090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

  1. [1]

    Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter

    Zeyuan Allen-Zhu. Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machin e Learning (ICML), volume 70 of Proceedings of Machine Learning Research , pages 89–97, Sydney, Australia, 2017

  2. [2]

    Natasha 2: Faster non-convex optimiz ation than SGD

    Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimiz ation than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018

  3. [3]

    Variance reduction for faster non-convex optimization

    Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machin e Learning (ICML) , pages 699–707, 2016

  4. [4]

    First-Order Methods in Optimization

    Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017

  5. [5]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Y a nn LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchin a. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016

  6. [6]

    Policy evaluation with temporal differences: a survey and comparison

    Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research , 15(1):809–883, 2014

  7. [7]

    SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014

  8. [8]

    Spider: Near-optimal non- convex optimization via stochastic path-integrated differ ential estimator

    Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path-integrated differ ential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018

  9. [9]

    Stochastic first- and zer oth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zer oth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

  10. [10]

    Mollifying Networks

    Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Y oshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016

  11. [11]

    O n graduated optimization for stochastic non-convex problems

    Elad Hazan, Kfir Y ehuda Levy, and Shai Shalev-Shwartz. O n graduated optimization for stochastic non-convex problems. In International conference on machine learning , pages 1833–1841, 2016

  12. [12]

    Accelerate d method for stochastic composition optimization with nonsmooth regularization

    Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerate d method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 3287–3294, 2018

  13. [13]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26 , pages 315–323, 2013

  14. [14]

    Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition

    Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016

  15. [15]

    N on-convex finite-sum optimization via SCSG methods

    Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. N on-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30 , pages 2348–

  16. [16]

    Curran Associates, Inc., 2017

  17. [17]

    A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization

    Zhize Li and Jian Li. A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 5564–5574. Curran Associates, Inc., 2018

  18. [18]

    Finite-sum compo sition optimization via variance reduced gradient descent

    Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum compo sition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artific ial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017. 9

  19. [19]

    On the link between gau ssian homotopy continuation and convex envelopes

    Hossein Mobahi and John W Fisher. On the link between gau ssian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in C omputer Vision and Pattern Recognition, pages 43–56. Springer, 2015

  20. [20]

    Gradient methods for minimizing compo site functions

    Yurii Nesterov. Gradient methods for minimizing compo site functions. Mathematical Pro- gramming, 140(1):125–161, 2013

  21. [21]

    Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč

    Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Y ee Whye Teh, editors, Proceedings of the 34th International Conference on Machin e Learning (ICML) , volume 70 of Proceedings of Machine Learning Research (PMLR) , pages 2613–2621, Sydn...

  22. [22]

    Finite-Sum Smooth Optimization with SARAH

    Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha N guyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization w ith sarah. arXiv preprint, arXiv:1901.07648, 2019

  23. [23]

    ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization

    Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran- Dinh. ProxSARAH: An effi- cient algorithmic framework for stochastic composite nonc onvex optimization. arXiv preprint, arXiv:1902.05679, 2019

  24. [24]

    Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola

    Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 314–323, New Y ork, New Y ork, USA, 2016

  25. [25]

    Fast incremental method for smooth nonconvex optimization

    Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex S mola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016

  26. [26]

    Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization

    Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexa nder J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. I n Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016

  27. [27]

    Tyrrell Rockafellar

    R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970

  28. [28]

    Tyrrell Rockafellar

    R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. IN- FORMS TutORials in Operations Research, 2007

  29. [29]

    Advances in risk-averse optimiz ation

    Andrzej Ruszczyński. Advances in risk-averse optimiz ation. INFORMS TutORials in Operation Research, 2013

  30. [30]

    Reinforcement Learning: An Introduction

    Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 1998

  31. [31]

    Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions

    Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions. Mathematical Programming, 161(1-2):419–449, 2017

  32. [32]

    Accelerating stoch astic composition optimization

    Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stoch astic composition optimization. Journal of Machine Learning Research , 18(105):1–23, 2017

  33. [33]

    SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization

    Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tar okh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018

  34. [34]

    A proximal stochastic gradient method with progressive variance reduction

    Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization , 24(4):2057–2075, 2014

  35. [35]

    A composite randomized increm ental gradient method

    Junyu Zhang and Lin Xiao. A composite randomized increm ental gradient method. In Pro- ceedings of the 36th International Conference on Machine Le arning (ICML) , number 97 in Proceedings of Machine Learning Research (PMLR), Long Beac h, California, 2019

  36. [36]

    Stochastic neste d variance reduced gradient descent for nonconvex optimization

    Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic neste d variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 3921–3932. Curran Associates, Inc., 2018. 10 Appendices A Convergence analysis for composite expectation case In this section, we focus on convergence analysis of CIVR for s...

  37. [37]

    By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

    − Φ∗ + 3σ2 0 η 4B τT, where we have applied the fact that xt 0 = xt− 1 τ . By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

  38. [38]

    (45) Substitute the values of T, τand B gives (22)

    − Φ∗) τT η + 6σ2 0 B . (45) Substitute the values of T, τand B gives (22). To simplify presentation, we omit ⌈·⌉ on integer parameters in the following discussion. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ √ ǫ, B = σ2 0 / ǫ, and τ= S = 1/ √ ǫ, we have E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

  39. [39]

    • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

    − Φ∗) η− 1 + 6) ǫ, and the sample complexity is T ( B + 2τS) = O (σ2 0 ǫ− 3/ 2 + ǫ− 3/ 2) , as in our theorem. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

  40. [40]

    For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity

    − Φ∗) η− 1 + 6) ǫ, but the sample complexity isT ( B+2τS) = O (σ2 0 ǫ− 2 +ǫ− 1) , which is same as in Ghadimi and Lan [9]. For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity. 14 A.2 Proof of Theorem 2 Proof. Note that for this set of parameters, we still have the relati onship that τt = St . Therefore, within each epoch, (44...

  41. [41]

    (46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

    − Φ∗ + T∑ t=1 3σ2 0 η 4Bt τt . (46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

  42. [42]

    (47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉

    − Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T t=1 τt / Bt ∑T t=1 τt . (47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉. We have T∑ t=1 τt ≥ T∑ t=1 at + b = a 2 T ( T + 1) + bT = O( T 2) and σ2 0 T∑ t=1 τt / Bt ≤ T∑ t=1 1 at + b ≤ 1 a + b + /uni222B.dsp T 1 dt at + b = 1 a + b + 1 a ln ( aT + b a + b ) = O( ln T ) . Substituting the above bounds into inequality (4...

  43. [43]

    This is more close to the classical results on stochastic opt imization

    − Φ∗) ηT + 6 lnT T , but the sample complexity, by setting T = ˜O( ǫ− 1) so that the above bound is less than ǫ, is T∑ t=1 (Bt + 2τt St ) ≤ T∑ t=1 ( σ2 0 ( at + b) + 2 ) = O( σ2 0 T 2 + T ) = ˜O( σ2 0 ǫ− 2 + ǫ− 1) . This is more close to the classical results on stochastic opt imization. B Convergence analysis for composite finite-sum case In this section,...

  44. [44]

    Namely, in (11) of Algorithm 1, we choose Bt = { 1,

    by their exact value rather than the approximate ones constructed by subsa mpling. Namely, in (11) of Algorithm 1, we choose Bt = { 1, . . . , n} for all t ≥ 1. Therefore, yt 0 = g( xt

  45. [45]

    = 1 n n∑ j=1 gj ( xt

  46. [46]

    = 1 n n∑ j=1 g′ j ( xt 0) and E [ ∥yt 0 − g( xt 0)∥ 2] = 0 , E [ ∥zt 0 − g′( xk 0 )∥ 2] = 0 . (48) 15 As a result, the initial variances in Lemma 1 diminishes and ( 33) reduces to              E[∥yt i − g( xt i )∥ 2] ≤ i∑ r=1 ℓ2 g St E[∥xt r − xt r− 1 ∥2] , E[∥zt i − g′( xt i )∥ 2] ≤ i∑ r=1 L2 g St E[∥xt r − xt r− 1∥2] . (49) In addition, com...

  47. [47]

    − Φ∗) τT η = 8( Φ( x1

  48. [48]

    Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution

    − Φ∗) √ nTη . Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution. Note that the sample complexity per epoch is n + τt St = 2n, the total sample complexity will be O( n + √ nǫ− 1) . B.2 Proof of Theorem 4 Proof. If T ≤ T0, then the result is exactly what we proved from Theorem 2. The refore, the first bound in (26) is already guaranteed. If T > T...

  49. [49]

    (51) When T0 + 1 ≤ t ≤ T , the following bound becomes effective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗

    − E[Φ( xT0+1 0 )] + T0∑ t=1 3σ2 0 η 4Bt τt . (51) When T0 + 1 ≤ t ≤ T , the following bound becomes effective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗. Therefore, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

  50. [50]

    − Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T0 t=1 τt / Bt ∑T t=1 τt . Note that T0∑ t=1 τt / Bt ≤ T0∑ t=1 1 at + b ≤ 1 a + b + 1 a ln ( aT0 + b a + b ) = O( ln n) , and T∑ t=1 τt ≥ ( T − T0) √ n + T0∑ t=1 ( at + b) = √ n( T − T0) + a 2 T 2 0 + ( a 2 + b) T0 = O( √ n( T − T0 + 1)) . 16 With the above two bounds, we have proved the second result in (26). For any ǫ >0, if ...

  51. [51]

    − F ∗) τT η + 6νσ2 0 B By the selection of T = ⌈ 16ν√ ǫ η ⌉ , τ= 1/ √ ǫand B = 1 + 12νσ2 0 / ǫ, we have E[F( ¯x) − F ∗] ≤ 1 2 ( F( x1

  52. [52]

    Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2,

    − F ∗) + 1 2 ǫ, (52) which is (28). Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2, ... denotes the number of restarts. We use the output of the kth period ¯xk as the initial point to start the next period, which produces ¯ xk+1. As a result, the above inequality translates to E[F( ¯xk...

  53. [53]

    C.2 Proof of Theorem 6 The proof is very similar to the previous one

    − F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 16ν η ( 12νσ2 0 ǫ + 2 ) ln 1 ǫ= O ( ν2σ2 0 ǫ− 1/ 2 + ν ) ln ǫ− 1 Defining the condition number κ= LF ν= O( ν/ η) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. C.2 Proof of Theorem 6...

  54. [54]

    By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n

    − F ∗) τT η . By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n. we again obtain (52). In this case, we have B = n and T ( B + 2τS) ǫ− 1 = ⌈ 16ν η√ n ⌉ ( n + 2√ n√ n ) ln ǫ− 1 = O ( n + ν√ n ) ln ǫ− 1. D Convergence analysis under optimally strong convexity In order to prove Theorems 7 and 8, we first state Lemma 3 in [33] in our notations. Lemma 5 (Lemma 3 in [33...

  55. [55]

    According to the selection of τt, St, Bt and η, we know that the coefficient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0

    + 2µ∥xt 0 − x∗∥2] −( 1 2η− LF 2 − τt 9G0η 2St ) τt∑ r=1 E[∥xt r − xt r− 1∥2] + τt 9σ2 0 η 2Bt . According to the selection of τt, St, Bt and η, we know that the coefficient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0. Consequently, 2µη τt − 1∑ i=0 E[Φ( xt i+1) − Φ∗] ≤ E[Φ( xt τt ) + 2µ∥xt τt − x∗∥2] − E[Φ( xt

  56. [56]

    Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt

    + 2µ∥xt 0 − x∗∥2] + τt 9σ2 0 η 2Bt . Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt . If we choose T = ⌈5√ ǫ µη ⌉, τ= S = 1√ ǫ and Bt = 1 + 9σ 2 0 2µǫ , then 5 2µητ T ≤ 1 2 and we obtain E[Φ( ¯x) − Φ∗] ≤ 1 2 E[Φ( x1

  57. [57]

    This proves the inequality (31)

    − Φ∗] + 1 2 ǫ . This proves the inequality (31). The rest of the proof will mi mic that of Theorem 5. Discussions on sample complexity: • If we choose τ= S = 1/ √ ǫ, Bt = 1 + 9σ 2 0 2µǫ , and T = ⌈5√ ǫ µη ⌉, then the sample complexity is T ( B + 2τS) ln 1 ǫ= 5√ ǫ µη ( 9σ2 0 2µǫ+ 1√ ǫ 1√ ǫ ) ln 1 ǫ= O ( ( µ− 2σ2 0 ǫ− 1/ 2 + µ− 1ǫ− 1/ 2) ln ǫ− 1 ) . The abo...

  58. [58]

    D.2 Proof of Theorem 8 The proof is very similar to the previous one

    − F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 5 µη ( 9σ2 0 µǫ+ 2 ) ln 1 ǫ= O ( µ− 2σ2 0 ǫ− 1 + µ− 1 ) ln ǫ− 1 Defining the condition number κ= LF ν= O( 1/( µη)) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. D.2 Proof of Theorem...