A Stochastic Composite Gradient Method with Incremental Variance Reduction

Junyu Zhang; Lin Xiao

arxiv: 1906.10186 · v1 · pith:EWB5ZLMVnew · submitted 2019-06-24 · 🧮 math.OC · stat.ML

A Stochastic Composite Gradient Method with Incremental Variance Reduction

Junyu Zhang , Lin Xiao This is my paper

Pith reviewed 2026-05-25 17:00 UTC · model grok-4.3

classification 🧮 math.OC stat.ML

keywords stochastic optimizationvariance reductioncomposite functionsnonconvex optimizationfinite-sum problemsgradient methodsexpectation minimizationmachine learning

0 comments

The pith

A stochastic composite gradient method with incremental variance reduction matches the best known complexity for nonconvex finite-sum and expectation problems despite biased estimators from composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to minimize the composition of a smooth nonconvex outer function with a smooth inner vector mapping expressed as an expectation or finite sum. It introduces an incremental variance-reduced estimator that tracks both the inner mapping and its Jacobian. The central result is that this approach attains the same complexity orders as leading first-order methods for ordinary nonconvex stochastic problems. The result holds even though the outer composition introduces bias into the gradient estimator. This extends the reach of low-complexity variance reduction techniques to a larger set of machine learning objectives.

Core claim

We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased.

What carries the argument

Incremental variance-reduced estimator applied simultaneously to the inner vector mapping and its Jacobian, which controls bias introduced by the outer composition.

If this is right

The same complexity orders apply to both expectation-based and finite-sum formulations of the composite problem.
A wider class of machine learning models involving composition can now use incremental variance reduction without losing optimal rates.
The method preserves the low per-iteration cost typical of incremental variance reduction while handling the extra Jacobian estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The estimator construction may extend to other optimization settings where gradient estimates are biased by an outer nonlinear mapping.
Practical implementations could be tested on composite losses arising in structured prediction or generative models to measure wall-clock gains.
The analysis framework might adapt to related problems such as bilevel optimization or hyperparameter tuning that also feature composed expectations.

Load-bearing premise

The inner mapping and outer function are both smooth and the variance reduction estimator satisfies standard incremental update properties used in SVRG and SAGA analyses.

What would settle it

A concrete nonconvex composite objective where the proposed method requires more than the stated number of iterations or gradient evaluations to reach a given accuracy level.

Figures

Figures reproduced from arXiv: 1906.10186 by Junyu Zhang, Lin Xiao.

**Figure 2.** Figure 2: Experiments on policy evaluation for MDP for cases [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

read the original abstract

We consider the problem of minimizing the composition of a smooth (nonconvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a finite sum. We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This finding enables a much broader range of applications in machine learning to benefit from the low complexity of incremental variance-reduction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends incremental variance reduction to composite nonconvex stochastic problems while keeping the same complexity order as the best non-composite methods.

read the letter

The main point is a stochastic gradient method for min f(g(x)) where g is an expectation or finite sum. It applies incremental variance-reduced estimators to both g and its Jacobian at once, and claims this controls the bias from the outer composition so the leading complexity terms stay unchanged from standard nonconvex VR results like SVRG or SAGA. That is the actual new piece: the simultaneous reduction on map and Jacobian for the composite case, which prior VR work did not directly cover. The abstract states the rates match the known bounds for expected-value and finite-sum nonconvex minimization, which is the useful claim for machine learning settings that involve compositions. The setup uses only standard smoothness and incremental update assumptions, with no circularity or invented parameters beyond the usual step size. The bias-handling step is the load-bearing part of the analysis, and while the stress-test note indicates it works under those assumptions without hidden extra factors, the full proof would still need checking for any overlooked logarithmic or constant terms. No experiments are described in the abstract, so practical behavior is not visible here. This is for people already working on first-order stochastic methods for nonconvex finite-sum problems. A reader who needs VR rates on composite objectives would get the algorithmic idea and the complexity statement. The work shows clear engagement with the literature and a sharp claim, so it deserves referee time even if revisions are needed on the proof details. I would send it to peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a stochastic composite gradient method with incremental variance-reduced estimators applied to both the inner vector mapping and its Jacobian. The central claim is that the resulting algorithm attains the same iteration complexity orders as the best-known first-order variance-reduced methods for nonconvex expected-value and finite-sum problems, even though the outer composition renders the gradient estimator biased.

Significance. If the complexity result is correct, the work is significant because it shows that the bias induced by the composition can be controlled under standard smoothness assumptions without inflating the leading terms, thereby extending the reach of low-complexity incremental VR methods to a wider class of composite problems that arise in machine learning.

minor comments (3)

The abstract would be strengthened by explicitly stating the concrete complexity orders achieved (e.g., O(1/ε²) or O(n + √n/ε) depending on the setting) rather than only claiming they match prior work.
§2: the notation distinguishing the two incremental estimators (one for the mapping, one for the Jacobian) should be introduced earlier and with an explicit comparison to the standard SVRG/SAGA recursions to improve readability.
The experimental section would benefit from additional baselines that isolate the effect of the Jacobian variance reduction versus using only a mapping estimator.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments under the MAJOR COMMENTS section.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives complexity bounds for a stochastic composite gradient method using incremental variance-reduced estimators for both the inner mapping and its Jacobian. The central claim—that the method matches the best-known orders for nonconvex expected-value and finite-sum problems despite estimator bias—is established by controlling the bias term under standard smoothness assumptions on the inner and outer functions together with the incremental update properties of variance reduction (as in SVRG/SAGA). These assumptions are invoked directly in the analysis rather than being fitted to data or defined in terms of the target result; no self-citation chain, ansatz smuggling, or renaming of known empirical patterns is required for the leading terms. The derivation therefore remains self-contained against external benchmarks from the variance-reduction literature.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper relies on standard smoothness assumptions for the outer function and inner mapping, plus properties of incremental variance reduction estimators. No new entities are invented. Step-size parameters are chosen based on problem constants but are not data-fitted in the usual sense.

free parameters (1)

step size
Chosen based on smoothness constants and variance bounds to achieve the stated complexity; typical in convergence analyses.

axioms (2)

domain assumption Outer function and inner mapping are smooth (Lipschitz gradients).
Standard assumption for first-order methods; invoked to bound the bias and variance in the composite estimator.
domain assumption Incremental variance-reduced estimator satisfies the standard unbiasedness and variance reduction properties.
Relies on properties from prior SVRG/SAGA literature.

pith-pipeline@v0.9.0 · 5628 in / 1290 out tokens · 19419 ms · 2026-05-25T17:00:15.376090+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian... achieves the same orders of complexity as the best known first-order methods... despite the additional outer composition which renders the composite gradient estimator biased.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 1... f and g_ξ Lipschitz and smooth... LF = ℓ_g² Lf + ℓf Lg

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

[1]

Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter

Zeyuan Allen-Zhu. Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machin e Learning (ICML), volume 70 of Proceedings of Machine Learning Research , pages 89–97, Sydney, Australia, 2017

work page 2017
[2]

Natasha 2: Faster non-convex optimiz ation than SGD

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimiz ation than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018

work page 2018
[3]

Variance reduction for faster non-convex optimization

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machin e Learning (ICML) , pages 699–707, 2016

work page 2016
[4]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017

work page 2017
[5]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Y a nn LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchin a. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Policy evaluation with temporal diﬀerences: a survey and comparison

Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal diﬀerences: a survey and comparison. Journal of Machine Learning Research , 15(1):809–883, 2014

work page 2014
[7]

SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014

work page 2014
[8]

Spider: Near-optimal non- convex optimization via stochastic path-integrated diﬀer ential estimator

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path-integrated diﬀer ential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018

work page 2018
[9]

Stochastic ﬁrst- and zer oth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst- and zer oth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

work page 2013
[10]

Mollifying Networks

Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Y oshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

O n graduated optimization for stochastic non-convex problems

Elad Hazan, Kﬁr Y ehuda Levy, and Shai Shalev-Shwartz. O n graduated optimization for stochastic non-convex problems. In International conference on machine learning , pages 1833–1841, 2016

work page 2016
[12]

Accelerate d method for stochastic composition optimization with nonsmooth regularization

Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerate d method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence, pages 3287–3294, 2018

work page 2018
[13]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26 , pages 315–323, 2013

work page 2013
[14]

Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016

work page 2016
[15]

N on-convex ﬁnite-sum optimization via SCSG methods

Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. N on-convex ﬁnite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30 , pages 2348–

work page
[16]

Curran Associates, Inc., 2017

work page 2017
[17]

A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization

Zhize Li and Jian Li. A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 5564–5574. Curran Associates, Inc., 2018

work page 2018
[18]

Finite-sum compo sition optimization via variance reduced gradient descent

Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum compo sition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artiﬁc ial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017. 9

work page 2017
[19]

On the link between gau ssian homotopy continuation and convex envelopes

Hossein Mobahi and John W Fisher. On the link between gau ssian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in C omputer Vision and Pattern Recognition, pages 43–56. Springer, 2015

work page 2015
[20]

Gradient methods for minimizing compo site functions

Yurii Nesterov. Gradient methods for minimizing compo site functions. Mathematical Pro- gramming, 140(1):125–161, 2013

work page 2013
[21]

Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Y ee Whye Teh, editors, Proceedings of the 34th International Conference on Machin e Learning (ICML) , volume 70 of Proceedings of Machine Learning Research (PMLR) , pages 2613–2621, Sydn...

work page 2017
[22]

Finite-Sum Smooth Optimization with SARAH

Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha N guyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization w ith sarah. arXiv preprint, arXiv:1901.07648, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[23]

ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization

Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran- Dinh. ProxSARAH: An eﬃ- cient algorithmic framework for stochastic composite nonc onvex optimization. arXiv preprint, arXiv:1902.05679, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[24]

Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 314–323, New Y ork, New Y ork, USA, 2016

work page 2016
[25]

Fast incremental method for smooth nonconvex optimization

Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex S mola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016

work page 2016
[26]

Proximal stochastic methods for nonsmooth nonconvex ﬁnite-sum optimization

Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexa nder J Smola. Proximal stochastic methods for nonsmooth nonconvex ﬁnite-sum optimization. I n Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016

work page 2016
[27]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970

work page 1970
[28]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. IN- FORMS TutORials in Operations Research, 2007

work page 2007
[29]

Advances in risk-averse optimiz ation

Andrzej Ruszczyński. Advances in risk-averse optimiz ation. INFORMS TutORials in Operation Research, 2013

work page 2013
[30]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 1998

work page 1998
[31]

Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions

Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions. Mathematical Programming, 161(1-2):419–449, 2017

work page 2017
[32]

Accelerating stoch astic composition optimization

Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stoch astic composition optimization. Journal of Machine Learning Research , 18(105):1–23, 2017

work page 2017
[33]

SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tar okh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018

work page arXiv 2018
[34]

A proximal stochastic gradient method with progressive variance reduction

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization , 24(4):2057–2075, 2014

work page 2057
[35]

A composite randomized increm ental gradient method

Junyu Zhang and Lin Xiao. A composite randomized increm ental gradient method. In Pro- ceedings of the 36th International Conference on Machine Le arning (ICML) , number 97 in Proceedings of Machine Learning Research (PMLR), Long Beac h, California, 2019

work page 2019
[36]

Stochastic neste d variance reduced gradient descent for nonconvex optimization

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic neste d variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 3921–3932. Curran Associates, Inc., 2018. 10 Appendices A Convergence analysis for composite expectation case In this section, we focus on convergence analysis of CIVR for s...

work page 2018
[37]

By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

− Φ∗ + 3σ2 0 η 4B τT, where we have applied the fact that xt 0 = xt− 1 τ . By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page
[38]

(45) Substitute the values of T, τand B gives (22)

− Φ∗) τT η + 6σ2 0 B . (45) Substitute the values of T, τand B gives (22). To simplify presentation, we omit ⌈·⌉ on integer parameters in the following discussion. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ √ ǫ, B = σ2 0 / ǫ, and τ= S = 1/ √ ǫ, we have E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

work page
[39]

• With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

− Φ∗) η− 1 + 6) ǫ, and the sample complexity is T ( B + 2τS) = O (σ2 0 ǫ− 3/ 2 + ǫ− 3/ 2) , as in our theorem. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

work page
[40]

For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity

− Φ∗) η− 1 + 6) ǫ, but the sample complexity isT ( B+2τS) = O (σ2 0 ǫ− 2 +ǫ− 1) , which is same as in Ghadimi and Lan [9]. For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity. 14 A.2 Proof of Theorem 2 Proof. Note that for this set of parameters, we still have the relati onship that τt = St . Therefore, within each epoch, (44...

work page
[41]

(46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

− Φ∗ + T∑ t=1 3σ2 0 η 4Bt τt . (46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page
[42]

(47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉

− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T t=1 τt / Bt ∑T t=1 τt . (47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉. We have T∑ t=1 τt ≥ T∑ t=1 at + b = a 2 T ( T + 1) + bT = O( T 2) and σ2 0 T∑ t=1 τt / Bt ≤ T∑ t=1 1 at + b ≤ 1 a + b + /uni222B.dsp T 1 dt at + b = 1 a + b + 1 a ln ( aT + b a + b ) = O( ln T ) . Substituting the above bounds into inequality (4...

work page
[43]

This is more close to the classical results on stochastic opt imization

− Φ∗) ηT + 6 lnT T , but the sample complexity, by setting T = ˜O( ǫ− 1) so that the above bound is less than ǫ, is T∑ t=1 (Bt + 2τt St ) ≤ T∑ t=1 ( σ2 0 ( at + b) + 2 ) = O( σ2 0 T 2 + T ) = ˜O( σ2 0 ǫ− 2 + ǫ− 1) . This is more close to the classical results on stochastic opt imization. B Convergence analysis for composite ﬁnite-sum case In this section,...

work page
[44]

Namely, in (11) of Algorithm 1, we choose Bt = { 1,

by their exact value rather than the approximate ones constructed by subsa mpling. Namely, in (11) of Algorithm 1, we choose Bt = { 1, . . . , n} for all t ≥ 1. Therefore, yt 0 = g( xt

work page
[45]

= 1 n n∑ j=1 gj ( xt

work page
[46]

= 1 n n∑ j=1 g′ j ( xt 0) and E [ ∥yt 0 − g( xt 0)∥ 2] = 0 , E [ ∥zt 0 − g′( xk 0 )∥ 2] = 0 . (48) 15 As a result, the initial variances in Lemma 1 diminishes and ( 33) reduces to              E[∥yt i − g( xt i )∥ 2] ≤ i∑ r=1 ℓ2 g St E[∥xt r − xt r− 1 ∥2] , E[∥zt i − g′( xt i )∥ 2] ≤ i∑ r=1 L2 g St E[∥xt r − xt r− 1∥2] . (49) In addition, com...

work page
[47]

− Φ∗) τT η = 8( Φ( x1

work page
[48]

Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution

− Φ∗) √ nTη . Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution. Note that the sample complexity per epoch is n + τt St = 2n, the total sample complexity will be O( n + √ nǫ− 1) . B.2 Proof of Theorem 4 Proof. If T ≤ T0, then the result is exactly what we proved from Theorem 2. The refore, the ﬁrst bound in (26) is already guaranteed. If T > T...

work page
[49]

(51) When T0 + 1 ≤ t ≤ T , the following bound becomes eﬀective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗

− E[Φ( xT0+1 0 )] + T0∑ t=1 3σ2 0 η 4Bt τt . (51) When T0 + 1 ≤ t ≤ T , the following bound becomes eﬀective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗. Therefore, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page
[50]

− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T0 t=1 τt / Bt ∑T t=1 τt . Note that T0∑ t=1 τt / Bt ≤ T0∑ t=1 1 at + b ≤ 1 a + b + 1 a ln ( aT0 + b a + b ) = O( ln n) , and T∑ t=1 τt ≥ ( T − T0) √ n + T0∑ t=1 ( at + b) = √ n( T − T0) + a 2 T 2 0 + ( a 2 + b) T0 = O( √ n( T − T0 + 1)) . 16 With the above two bounds, we have proved the second result in (26). For any ǫ >0, if ...

work page
[51]

− F ∗) τT η + 6νσ2 0 B By the selection of T = ⌈ 16ν√ ǫ η ⌉ , τ= 1/ √ ǫand B = 1 + 12νσ2 0 / ǫ, we have E[F( ¯x) − F ∗] ≤ 1 2 ( F( x1

work page
[52]

Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2,

− F ∗) + 1 2 ǫ, (52) which is (28). Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2, ... denotes the number of restarts. We use the output of the kth period ¯xk as the initial point to start the next period, which produces ¯ xk+1. As a result, the above inequality translates to E[F( ¯xk...

work page
[53]

C.2 Proof of Theorem 6 The proof is very similar to the previous one

− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 16ν η ( 12νσ2 0 ǫ + 2 ) ln 1 ǫ= O ( ν2σ2 0 ǫ− 1/ 2 + ν ) ln ǫ− 1 Deﬁning the condition number κ= LF ν= O( ν/ η) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. C.2 Proof of Theorem 6...

work page
[54]

By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n

− F ∗) τT η . By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n. we again obtain (52). In this case, we have B = n and T ( B + 2τS) ǫ− 1 = ⌈ 16ν η√ n ⌉ ( n + 2√ n√ n ) ln ǫ− 1 = O ( n + ν√ n ) ln ǫ− 1. D Convergence analysis under optimally strong convexity In order to prove Theorems 7 and 8, we ﬁrst state Lemma 3 in [33] in our notations. Lemma 5 (Lemma 3 in [33...

work page
[55]

According to the selection of τt, St, Bt and η, we know that the coeﬃcient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0

+ 2µ∥xt 0 − x∗∥2] −( 1 2η− LF 2 − τt 9G0η 2St ) τt∑ r=1 E[∥xt r − xt r− 1∥2] + τt 9σ2 0 η 2Bt . According to the selection of τt, St, Bt and η, we know that the coeﬃcient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0. Consequently, 2µη τt − 1∑ i=0 E[Φ( xt i+1) − Φ∗] ≤ E[Φ( xt τt ) + 2µ∥xt τt − x∗∥2] − E[Φ( xt

work page
[56]

Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt

+ 2µ∥xt 0 − x∗∥2] + τt 9σ2 0 η 2Bt . Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt . If we choose T = ⌈5√ ǫ µη ⌉, τ= S = 1√ ǫ and Bt = 1 + 9σ 2 0 2µǫ , then 5 2µητ T ≤ 1 2 and we obtain E[Φ( ¯x) − Φ∗] ≤ 1 2 E[Φ( x1

work page
[57]

This proves the inequality (31)

− Φ∗] + 1 2 ǫ . This proves the inequality (31). The rest of the proof will mi mic that of Theorem 5. Discussions on sample complexity: • If we choose τ= S = 1/ √ ǫ, Bt = 1 + 9σ 2 0 2µǫ , and T = ⌈5√ ǫ µη ⌉, then the sample complexity is T ( B + 2τS) ln 1 ǫ= 5√ ǫ µη ( 9σ2 0 2µǫ+ 1√ ǫ 1√ ǫ ) ln 1 ǫ= O ( ( µ− 2σ2 0 ǫ− 1/ 2 + µ− 1ǫ− 1/ 2) ln ǫ− 1 ) . The abo...

work page
[58]

D.2 Proof of Theorem 8 The proof is very similar to the previous one

− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 5 µη ( 9σ2 0 µǫ+ 2 ) ln 1 ǫ= O ( µ− 2σ2 0 ǫ− 1 + µ− 1 ) ln ǫ− 1 Deﬁning the condition number κ= LF ν= O( 1/( µη)) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. D.2 Proof of Theorem...

work page 2000

[1] [1]

Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter

Zeyuan Allen-Zhu. Natasha: Faster non-convex stochast ic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machin e Learning (ICML), volume 70 of Proceedings of Machine Learning Research , pages 89–97, Sydney, Australia, 2017

work page 2017

[2] [2]

Natasha 2: Faster non-convex optimiz ation than SGD

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimiz ation than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018

work page 2018

[3] [3]

Variance reduction for faster non-convex optimization

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machin e Learning (ICML) , pages 699–707, 2016

work page 2016

[4] [4]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017

work page 2017

[5] [5]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Y a nn LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchin a. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Policy evaluation with temporal diﬀerences: a survey and comparison

Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal diﬀerences: a survey and comparison. Journal of Machine Learning Research , 15(1):809–883, 2014

work page 2014

[7] [7]

SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite obje ctives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014

work page 2014

[8] [8]

Spider: Near-optimal non- convex optimization via stochastic path-integrated diﬀer ential estimator

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path-integrated diﬀer ential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018

work page 2018

[9] [9]

Stochastic ﬁrst- and zer oth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst- and zer oth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

work page 2013

[10] [10]

Mollifying Networks

Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Y oshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

O n graduated optimization for stochastic non-convex problems

Elad Hazan, Kﬁr Y ehuda Levy, and Shai Shalev-Shwartz. O n graduated optimization for stochastic non-convex problems. In International conference on machine learning , pages 1833–1841, 2016

work page 2016

[12] [12]

Accelerate d method for stochastic composition optimization with nonsmooth regularization

Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerate d method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence, pages 3287–3294, 2018

work page 2018

[13] [13]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26 , pages 315–323, 2013

work page 2013

[14] [14]

Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear co nvergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz c ondition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016

work page 2016

[15] [15]

N on-convex ﬁnite-sum optimization via SCSG methods

Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. N on-convex ﬁnite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30 , pages 2348–

work page

[16] [16]

Curran Associates, Inc., 2017

work page 2017

[17] [17]

A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization

Zhize Li and Jian Li. A simple proximal stochastic gradi ent method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 5564–5574. Curran Associates, Inc., 2018

work page 2018

[18] [18]

Finite-sum compo sition optimization via variance reduced gradient descent

Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum compo sition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artiﬁc ial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017. 9

work page 2017

[19] [19]

On the link between gau ssian homotopy continuation and convex envelopes

Hossein Mobahi and John W Fisher. On the link between gau ssian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in C omputer Vision and Pattern Recognition, pages 43–56. Springer, 2015

work page 2015

[20] [20]

Gradient methods for minimizing compo site functions

Yurii Nesterov. Gradient methods for minimizing compo site functions. Mathematical Pro- gramming, 140(1):125–161, 2013

work page 2013

[21] [21]

Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak áč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Y ee Whye Teh, editors, Proceedings of the 34th International Conference on Machin e Learning (ICML) , volume 70 of Proceedings of Machine Learning Research (PMLR) , pages 2613–2621, Sydn...

work page 2017

[22] [22]

Finite-Sum Smooth Optimization with SARAH

Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha N guyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization w ith sarah. arXiv preprint, arXiv:1901.07648, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[23] [23]

ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization

Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran- Dinh. ProxSARAH: An eﬃ- cient algorithmic framework for stochastic composite nonc onvex optimization. arXiv preprint, arXiv:1902.05679, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[24] [24]

Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poc zos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 314–323, New Y ork, New Y ork, USA, 2016

work page 2016

[25] [25]

Fast incremental method for smooth nonconvex optimization

Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex S mola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016

work page 2016

[26] [26]

Proximal stochastic methods for nonsmooth nonconvex ﬁnite-sum optimization

Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexa nder J Smola. Proximal stochastic methods for nonsmooth nonconvex ﬁnite-sum optimization. I n Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016

work page 2016

[27] [27]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970

work page 1970

[28] [28]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. IN- FORMS TutORials in Operations Research, 2007

work page 2007

[29] [29]

Advances in risk-averse optimiz ation

Andrzej Ruszczyński. Advances in risk-averse optimiz ation. INFORMS TutORials in Operation Research, 2013

work page 2013

[30] [30]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 1998

work page 1998

[31] [31]

Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions

Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compo sitional gradient descent: algo- rithms for minimizing compositions of expected-value func tions. Mathematical Programming, 161(1-2):419–449, 2017

work page 2017

[32] [32]

Accelerating stoch astic composition optimization

Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stoch astic composition optimization. Journal of Machine Learning Research , 18(105):1–23, 2017

work page 2017

[33] [33]

SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tar okh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018

work page arXiv 2018

[34] [34]

A proximal stochastic gradient method with progressive variance reduction

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization , 24(4):2057–2075, 2014

work page 2057

[35] [35]

A composite randomized increm ental gradient method

Junyu Zhang and Lin Xiao. A composite randomized increm ental gradient method. In Pro- ceedings of the 36th International Conference on Machine Le arning (ICML) , number 97 in Proceedings of Machine Learning Research (PMLR), Long Beac h, California, 2019

work page 2019

[36] [36]

Stochastic neste d variance reduced gradient descent for nonconvex optimization

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic neste d variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31 , pages 3921–3932. Curran Associates, Inc., 2018. 10 Appendices A Convergence analysis for composite expectation case In this section, we focus on convergence analysis of CIVR for s...

work page 2018

[37] [37]

By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

− Φ∗ + 3σ2 0 η 4B τT, where we have applied the fact that xt 0 = xt− 1 τ . By the random sampling scheme for output ¯ x, we have E[∥G( ¯x)∥ 2] = 1 τT T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page

[38] [38]

(45) Substitute the values of T, τand B gives (22)

− Φ∗) τT η + 6σ2 0 B . (45) Substitute the values of T, τand B gives (22). To simplify presentation, we omit ⌈·⌉ on integer parameters in the following discussion. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ √ ǫ, B = σ2 0 / ǫ, and τ= S = 1/ √ ǫ, we have E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

work page

[39] [39]

• With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

− Φ∗) η− 1 + 6) ǫ, and the sample complexity is T ( B + 2τS) = O (σ2 0 ǫ− 3/ 2 + ǫ− 3/ 2) , as in our theorem. • With η≤ 4 LF + √ L2 F +12G0 , and letting T = 1/ ǫ, B = 1 + σ2 0 / ǫ, and τ= S = 1, we again obtain E[∥G( ¯x)∥ 2] ≤ 8(( Φ( x1

work page

[40] [40]

For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity

− Φ∗) η− 1 + 6) ǫ, but the sample complexity isT ( B+2τS) = O (σ2 0 ǫ− 2 +ǫ− 1) , which is same as in Ghadimi and Lan [9]. For deterministic optimization with σ0 = 0, this recovers the O( ǫ− 1) complexity. 14 A.2 Proof of Theorem 2 Proof. Note that for this set of parameters, we still have the relati onship that τt = St . Therefore, within each epoch, (44...

work page

[41] [41]

(46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

− Φ∗ + T∑ t=1 3σ2 0 η 4Bt τt . (46) By the random selection rule of ¯ x, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page

[42] [42]

(47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉

− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T t=1 τt / Bt ∑T t=1 τt . (47) Note that τt = ⌈at + b⌉ and Bt = ⌈σ2 0 ( at + b) 2⌉. We have T∑ t=1 τt ≥ T∑ t=1 at + b = a 2 T ( T + 1) + bT = O( T 2) and σ2 0 T∑ t=1 τt / Bt ≤ T∑ t=1 1 at + b ≤ 1 a + b + /uni222B.dsp T 1 dt at + b = 1 a + b + 1 a ln ( aT + b a + b ) = O( ln T ) . Substituting the above bounds into inequality (4...

work page

[43] [43]

This is more close to the classical results on stochastic opt imization

− Φ∗) ηT + 6 lnT T , but the sample complexity, by setting T = ˜O( ǫ− 1) so that the above bound is less than ǫ, is T∑ t=1 (Bt + 2τt St ) ≤ T∑ t=1 ( σ2 0 ( at + b) + 2 ) = O( σ2 0 T 2 + T ) = ˜O( σ2 0 ǫ− 2 + ǫ− 1) . This is more close to the classical results on stochastic opt imization. B Convergence analysis for composite ﬁnite-sum case In this section,...

work page

[44] [44]

Namely, in (11) of Algorithm 1, we choose Bt = { 1,

by their exact value rather than the approximate ones constructed by subsa mpling. Namely, in (11) of Algorithm 1, we choose Bt = { 1, . . . , n} for all t ≥ 1. Therefore, yt 0 = g( xt

work page

[45] [45]

= 1 n n∑ j=1 gj ( xt

work page

[46] [46]

= 1 n n∑ j=1 g′ j ( xt 0) and E [ ∥yt 0 − g( xt 0)∥ 2] = 0 , E [ ∥zt 0 − g′( xk 0 )∥ 2] = 0 . (48) 15 As a result, the initial variances in Lemma 1 diminishes and ( 33) reduces to              E[∥yt i − g( xt i )∥ 2] ≤ i∑ r=1 ℓ2 g St E[∥xt r − xt r− 1 ∥2] , E[∥zt i − g′( xt i )∥ 2] ≤ i∑ r=1 L2 g St E[∥xt r − xt r− 1∥2] . (49) In addition, com...

work page

[47] [47]

− Φ∗) τT η = 8( Φ( x1

work page

[48] [48]

Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution

− Φ∗) √ nTη . Therefore, we have to set T = O( 1√ nǫ ) to get an ǫ-solution. Note that the sample complexity per epoch is n + τt St = 2n, the total sample complexity will be O( n + √ nǫ− 1) . B.2 Proof of Theorem 4 Proof. If T ≤ T0, then the result is exactly what we proved from Theorem 2. The refore, the ﬁrst bound in (26) is already guaranteed. If T > T...

work page

[49] [49]

(51) When T0 + 1 ≤ t ≤ T , the following bound becomes eﬀective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗

− E[Φ( xT0+1 0 )] + T0∑ t=1 3σ2 0 η 4Bt τt . (51) When T0 + 1 ≤ t ≤ T , the following bound becomes eﬀective, η 8 T∑ t=T0+1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ E[Φ( xT0+1 0 )] − Φ∗. Therefore, we have E[∥G( ¯x)∥ 2] = 1 ∑T t=1 τt T∑ t=1 τ− 1∑ i=0 E[∥G( xt i )∥ 2] ≤ 8( Φ( x1

work page

[50] [50]

− Φ∗) η∑T t=1 τt + 6σ2 0 · ∑T0 t=1 τt / Bt ∑T t=1 τt . Note that T0∑ t=1 τt / Bt ≤ T0∑ t=1 1 at + b ≤ 1 a + b + 1 a ln ( aT0 + b a + b ) = O( ln n) , and T∑ t=1 τt ≥ ( T − T0) √ n + T0∑ t=1 ( at + b) = √ n( T − T0) + a 2 T 2 0 + ( a 2 + b) T0 = O( √ n( T − T0 + 1)) . 16 With the above two bounds, we have proved the second result in (26). For any ǫ >0, if ...

work page

[51] [51]

− F ∗) τT η + 6νσ2 0 B By the selection of T = ⌈ 16ν√ ǫ η ⌉ , τ= 1/ √ ǫand B = 1 + 12νσ2 0 / ǫ, we have E[F( ¯x) − F ∗] ≤ 1 2 ( F( x1

work page

[52] [52]

Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2,

− F ∗) + 1 2 ǫ, (52) which is (28). Suppose we periodically restart the Algorithm 1 after every T epochs, and set the outputs to be ¯ xk , where k = 1, 2, ... denotes the number of restarts. We use the output of the kth period ¯xk as the initial point to start the next period, which produces ¯ xk+1. As a result, the above inequality translates to E[F( ¯xk...

work page

[53] [53]

C.2 Proof of Theorem 6 The proof is very similar to the previous one

− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 16ν η ( 12νσ2 0 ǫ + 2 ) ln 1 ǫ= O ( ν2σ2 0 ǫ− 1/ 2 + ν ) ln ǫ− 1 Deﬁning the condition number κ= LF ν= O( ν/ η) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. C.2 Proof of Theorem 6...

work page

[54] [54]

By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n

− F ∗) τT η . By choosing T = ⌈ 16ν η √ n ⌉, τ= S = √ n. we again obtain (52). In this case, we have B = n and T ( B + 2τS) ǫ− 1 = ⌈ 16ν η√ n ⌉ ( n + 2√ n√ n ) ln ǫ− 1 = O ( n + ν√ n ) ln ǫ− 1. D Convergence analysis under optimally strong convexity In order to prove Theorems 7 and 8, we ﬁrst state Lemma 3 in [33] in our notations. Lemma 5 (Lemma 3 in [33...

work page

[55] [55]

According to the selection of τt, St, Bt and η, we know that the coeﬃcient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0

+ 2µ∥xt 0 − x∗∥2] −( 1 2η− LF 2 − τt 9G0η 2St ) τt∑ r=1 E[∥xt r − xt r− 1∥2] + τt 9σ2 0 η 2Bt . According to the selection of τt, St, Bt and η, we know that the coeﬃcient ( 1 2η − LF 2 − τt 9G0η 2St ) ≥ 0. Consequently, 2µη τt − 1∑ i=0 E[Φ( xt i+1) − Φ∗] ≤ E[Φ( xt τt ) + 2µ∥xt τt − x∗∥2] − E[Φ( xt

work page

[56] [56]

Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt

+ 2µ∥xt 0 − x∗∥2] + τt 9σ2 0 η 2Bt . Summing this up and apply the random selection rule of ¯ x gives E[Φ( ¯x) − Φ∗] ≤ 1 2µητT E[Φ( x1 0 ) − Φ∗ + 2µ∥x1 0 − x∗∥2] + 9σ2 0 4µBt ≤ 5 2µητT E[Φ( x1 0 ) − Φ∗] + 9σ2 0 4µBt . If we choose T = ⌈5√ ǫ µη ⌉, τ= S = 1√ ǫ and Bt = 1 + 9σ 2 0 2µǫ , then 5 2µητ T ≤ 1 2 and we obtain E[Φ( ¯x) − Φ∗] ≤ 1 2 E[Φ( x1

work page

[57] [57]

This proves the inequality (31)

− Φ∗] + 1 2 ǫ . This proves the inequality (31). The rest of the proof will mi mic that of Theorem 5. Discussions on sample complexity: • If we choose τ= S = 1/ √ ǫ, Bt = 1 + 9σ 2 0 2µǫ , and T = ⌈5√ ǫ µη ⌉, then the sample complexity is T ( B + 2τS) ln 1 ǫ= 5√ ǫ µη ( 9σ2 0 2µǫ+ 1√ ǫ 1√ ǫ ) ln 1 ǫ= O ( ( µ− 2σ2 0 ǫ− 1/ 2 + µ− 1ǫ− 1/ 2) ln ǫ− 1 ) . The abo...

work page

[58] [58]

D.2 Proof of Theorem 8 The proof is very similar to the previous one

− F ∗) + 1 2 ǫ, and the total sample complexity is T ( B + 2τS) ln 1 ǫ= 5 µη ( 9σ2 0 µǫ+ 2 ) ln 1 ǫ= O ( µ− 2σ2 0 ǫ− 1 + µ− 1 ) ln ǫ− 1 Deﬁning the condition number κ= LF ν= O( 1/( µη)) , the above complexity becomes T ( B + 2τS) ln 1 ǫ= O ( κ2σ2 0 ǫ− 1 + κ ) ln ǫ− 1 Thus when σ= 0, we have O (κln ǫ− 1) for deterministic optimization. D.2 Proof of Theorem...

work page 2000