Accelerating Mini-batch SARAH by Step Size Rules

Cheng Wang; Zengping Chen; Zhuang Yang

arxiv: 1906.08496 · v1 · pith:MPXX44LMnew · submitted 2019-06-20 · 💻 cs.LG · math.OC· stat.ML

Accelerating Mini-batch SARAH by Step Size Rules

Zhuang Yang , Zengping Chen , Cheng Wang This is my paper

Pith reviewed 2026-05-25 20:01 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords SARAHmini-batch optimizationBarzilai-Borwein step sizestochastic gradientlinear convergencestrongly convexvariance reductionstep size selection

0 comments

The pith

Mini-batch SARAH with a random Barzilai-Borwein step size rule converges linearly in expectation for strongly convex functions and has better complexity than the original method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MB-SARAH-RBB, a mini-batch version of SARAH that replaces manual step-size selection with the Random Barzilai-Borwein rule for computing step sizes on the fly. It proves that this change preserves linear convergence in expectation when the objective is strongly convex. The complexity analysis shows the new method requires fewer gradient evaluations than standard mini-batch SARAH to reach a given accuracy. Experiments on common datasets confirm that MB-SARAH-RBB matches or exceeds the performance of current algorithms without any hand-tuned parameters.

Core claim

By replacing fixed or manually chosen step sizes with the Random Barzilai-Borwein (RBB) rule inside the mini-batch SARAH framework, the resulting MB-SARAH-RBB algorithm converges linearly in expectation for strongly convex objectives, achieves strictly better iteration complexity than the baseline MB-SARAH, and performs competitively on standard data sets.

What carries the argument

The Random Barzilai-Borwein (RBB) step-size rule, which generates per-iteration step lengths from two recent gradient estimates so that the linear-convergence conditions of mini-batch SARAH remain satisfied.

If this is right

Linear convergence in expectation holds for strongly convex objectives.
Complexity is strictly better than the original mini-batch SARAH.
The algorithm matches or beats state-of-the-art methods on standard data sets without step-size tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive rule could be tested inside other variance-reduced recursions such as SVRG or SPIDER to remove their own step-size tuning burden.
Because the original SARAH framework already applies to non-convex problems, the RBB variant might be tried there even though the linear-rate proof does not cover that case.
The reduction in manual tuning lowers the practical barrier to deploying SARAH-style methods on new data sets where cross-validation is expensive.
The random choice inside the BB formula may add a mild smoothing effect that helps stability on ill-conditioned problems, an effect worth measuring separately.

Load-bearing premise

The random Barzilai-Borwein step-size rule produces values that satisfy the conditions needed for the linear convergence analysis of mini-batch SARAH to remain valid.

What would settle it

Run MB-SARAH-RBB on a simple strongly convex quadratic and observe whether the measured convergence rate is linear with the constant predicted by the theorem, or whether the step sizes occasionally violate the required bounds and cause the error to stop decreasing linearly.

Figures

Figures reproduced from arXiv: 1906.08496 by Cheng Wang, Zengping Chen, Zhuang Yang.

**Figure 2.** Figure 2: Comparison of MB-SARAH-RBB and MB-SARAH with fixed st [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of MB-SARAH-RBB and MB-SARAH with fixed st [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Different initial step sizes for MB-SARAH-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of MB-SARAH-RBB and mS2GD-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of MB-SARAH-RBB and mS2GD-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of different methods on three data sets: [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

StochAstic Recursive grAdient algoritHm (SARAH), originally proposed for convex optimization and also proven to be effective for general nonconvex optimization, has received great attention due to its simple recursive framework for updating stochastic gradient estimates. The performance of SARAH significantly depends on the choice of step size sequence. However, SARAH and its variants often employ a best-tuned step size by mentor, which is time consuming in practice. Motivated by this gap, we proposed a variant of the Barzilai-Borwein (BB) method, referred to as the Random Barzilai-Borwein (RBB) method, to calculate step size for SARAH in the mini-batch setting, thereby leading to a new SARAH method: MB-SARAH-RBB. We prove that MB-SARAH-RBB converges linearly in expectation for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that it is better than the original method. Numerical experiments on standard data sets indicate that MB-SARAH-RBB outperforms or matches state-of-the-art algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MB-SARAH-RBB adds a random Barzilai-Borwein step-size rule to mini-batch SARAH but the linear convergence claim rests on an unshown compatibility between the random steps and the original analysis bounds.

read the letter

The paper's main move is to replace tuned step sizes in mini-batch SARAH with a random Barzilai-Borwein rule, producing the variant they call MB-SARAH-RBB. This targets the practical problem of having to pick a good fixed step size by hand. The numerical experiments on standard data sets show the new version matches or beats several existing methods, which is the part that actually lands as useful engineering work. The combination itself is new as a specific recipe even if the underlying BB and SARAH pieces are older. The theory claims linear convergence in expectation for strongly convex functions plus better complexity than the base method. That is the part that needs scrutiny. The original SARAH analysis typically requires the step size to stay inside an interval set by the strong-convexity and smoothness constants. Because the RBB rule generates a fresh random value each iteration from the data, the proof only goes through if those random values remain admissible or if the analysis is extended to handle the randomness without extra error terms. The abstract gives no derivation or bound check on this point, so the complexity improvement does not yet follow from the stated argument. If the full paper simply invokes the old theorem without addressing the random-step issue, the guarantee is not established. This is the sort of paper that matters to people already working on variance-reduced stochastic methods who want fewer tuning knobs. A reader in that niche can extract the experimental comparison and the concrete algorithm description. It is not broad enough or deep enough to change how most people think about the area. I would send it to peer review so the authors can supply the missing step-size argument and any supporting lemmas; the core idea is coherent enough to be worth referee time even if the current write-up is incomplete on the proof.

Referee Report

2 major / 2 minor

Summary. The paper proposes MB-SARAH-RBB, a mini-batch variant of SARAH that replaces fixed or tuned step sizes with a Random Barzilai-Borwein (RBB) rule for automatic step-size selection. It claims to prove linear convergence in expectation for strongly convex objectives, to derive a complexity bound that improves on the original MB-SARAH, and to show competitive or superior empirical performance on standard datasets.

Significance. If the convergence claim holds, the work supplies a practical, tuning-free step-size mechanism that preserves the linear rate of mini-batch SARAH while improving practical usability; the complexity improvement and numerical results would strengthen the case for data-dependent step sizes in variance-reduced methods.

major comments (2)

[§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.
[§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.

minor comments (2)

[§3] Notation for the RBB step-size formula is introduced without an explicit equation number; cross-referencing would improve readability.
[§6] The numerical experiments section would benefit from reporting the observed range of realized RBB step sizes to allow readers to check consistency with the theoretical interval.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.

Authors: We agree that the proof in the current manuscript relies on the step-size interval from the original MB-SARAH analysis without explicitly verifying that RBB realizations remain inside it almost surely. In the revised version we will insert a new lemma establishing that the random Barzilai-Borwein step sizes satisfy the required bound with probability 1 (by exploiting the positive-definiteness of the mini-batch Hessian approximations and the boundedness of the gradient differences). If this deterministic guarantee proves too restrictive, we will instead extend the convergence argument to random step sizes by controlling the additional variance terms that appear when taking expectation over η_k. revision: yes
Referee: [§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.

Authors: The complexity claim in the manuscript is obtained by taking the expectation of the contraction factor with respect to the distribution of the RBB step sizes. To address the concern, the revised version will explicitly recompute the iteration complexity bound under this distribution, showing that the improvement holds in expectation without requiring further assumptions beyond those already stated for the RBB rule. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence proof for MB-SARAH-RBB is presented as independent analysis

full rationale

The paper proposes the RBB step-size rule as a new variant and separately claims to prove linear convergence in expectation plus improved complexity for strongly convex objectives. No equations, fitted parameters, or self-citations are visible that would reduce the claimed proof to a definition, a renamed input, or an unverified self-citation chain. The derivation chain therefore remains self-contained against external benchmarks such as the original SARAH analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the ledger is therefore limited to the single domain assumption stated in the abstract.

axioms (1)

domain assumption The objective function is strongly convex.
Explicitly required for the linear convergence claim in the abstract.

pith-pipeline@v0.9.0 · 5723 in / 1134 out tokens · 42591 ms · 2026-05-25T20:01:07.047489+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

[1]

Mairal, F

J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionar y learning for sparse coding, in: Proceedings of the 26th International Co nference on Machine Learning, ACM, 2009, pp. 689–696

work page 2009
[2]

T. Zhang, Solving large scale linear prediction problem s using stochastic gradient descent algorithms, in: Proceedings of the 21st In ternational Conference on Machine Learning, ACM, 2004, p. 116

work page 2004
[3]

P . F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ram anan, Object detection with discriminatively trained part-based model s, IEEE transac- tions on pattern analysis and machine intelligence 32 (9) (2 009) 1627– 1645

work page
[4]

Blanz, T

V . Blanz, T. V etter, Face recognition based on ﬁtting a 3d morphable model, IEEE Transactions on Pattern Analysis and Machine In telligence 25 (9) (2003) 1063–1074

work page 2003
[5]

Hoffman, F

M. Hoffman, F. R. Bach, D. M. Blei, Online learning for lat ent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864

work page 2010
[6]

Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p

L. Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p. 177– 186

work page 2010
[7]

Y . Lin, F. Lv, S. Zhu, M. Y ang, T. Cour, K. Y u, L. Cao, T. Huang, Large- scale image classiﬁcation: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696

work page 2011
[8]

Zhang, P

Z. Zhang, P . Luo, C. C. Loy, X. Tang, Facial landmark detec tion by deep multi-task learning, in: European Conference on Compu ter Vision, Springer, 2014, pp. 94–108

work page 2014
[9]

Tao, Q.-K

Q. Tao, Q.-K. Gao, D.-J. Chu, G.-W. Wu, Stochastic learni ng via optimizing the variational inequalities, IEEE Transactio ns on Neural Networks and Learning Systems 25 (10) (2014) 1769–1778

work page 2014
[10]

M. F. Bugallo, V . Elvira, L. Martino, D. Luengo, J. Migue z, P . M. Djuric, Adaptive importance sampling: the past, the present, and th e future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79

work page 2017
[11]

C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Netw orks and Learning Systems 29 (7) (2017) 3084–3096

work page 2017
[12]

Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

X.-L. Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

work page 2018
[13]

Moulines, F

E. Moulines, F. R. Bach, Non-asymptotic analysis of sto chastic ap- proximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459

work page 2011
[14]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust s tochastic approximation approach to stochastic programming, SIAM Jo urnal on optimization 19 (4) (2009) 1574–1609

work page 2009
[15]

Bottou, F

L. Bottou, F. E. Curtis, J. Nocedal, Optimization metho ds for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311

work page 2018
[16]

Cotter, O

A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better mi ni-batch algorithms via accelerated gradient methods, in: Advances in Neural Information Processing Systems, 2011, pp. 1647–1655

work page 2011
[17]

Koneˇ cn` y, J

J. Koneˇ cn` y, J. Liu, P . Richt´ arik, M. Tak´ aˇ c, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Se lected Topics in Signal Processing 10 (2) (2016) 242–255

work page 2016
[18]

Needell, R

D. Needell, R. Ward, N. Srebro, Stochastic gradient des cent, weighted sampling, and the randomized kaczmarz algorithm, in: Advan ces in Neural Information Processing Systems, 2014, pp. 1017–102 5

work page 2014
[19]

Csiba, P

D. Csiba, P . Richt´ arik, Importance sampling for minibatches, The Journal of Machine Learning Research 19 (1) (2018) 962–982

work page 2018
[20]

T. Fu, Z. Zhang, Cpsg-mcmc: Clustering-based preproce ssing method for stochastic gradient mcmc, in: Artiﬁcial Intelligence a nd Statistics, 2017, pp. 841–850

work page 2017
[21]

N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for ﬁnite training sets, in : Advances in Neural Information Processing Systems, 2012, pp. 2663–267 1

work page 2012
[22]

Defazio, F

A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast inc remental gradient method with support for non-strongly convex compo site ob- jectives, in: Advances in Neural Information Processing Sy stems, 2014, pp. 1646–1654

work page 2014
[23]

Shalev-Shwartz, T

S. Shalev-Shwartz, T. Zhang, Stochastic dual coordina te ascent methods for regularized loss minimization, Journal of Machine Lear ning Research 14 (Feb) (2013) 567–599

work page 2013
[24]

Johnson, T

R. Johnson, T. Zhang, Accelerating stochastic gradien t descent using predictive variance reduction, in: Advances in Neural Info rmation Pro- cessing Systems, 2013, pp. 315–323

work page 2013
[25]

Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582

work page 2014
[26]

L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Sarah: A novel method for machine learning problems using stochastic recursive g radient, in: International Conference on Machine Learning-V olume 70, J MLR. org, 2017, pp. 2613–2621

work page 2017
[27]

C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differentia l estimator, in: Advances in Neural Information Processing Systems, 2018, p p. 689– 699

work page 2018
[28]

L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Finite-sum smooth optimization with sara h, def 1 (2019) 1

work page 2019
[29]

L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Stochas tic recur- sive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261

work page internal anchor Pith review Pith/arXiv arXiv
[30]

L. M. Nguyen, K. Scheinberg, M. Tak´ aˇ c, Inexact sarah a lgorithm for stochastic optimization, arXiv preprint arXiv:1811.1010 5

work page arXiv
[31]

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

S. Horv´ ath, P . Richt´ arik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146

work page internal anchor Pith review Pith/arXiv arXiv
[32]

D. Zhou, Q. Gu, Stochastic recursive variance-reduced cubic regulariza- tion methods, arXiv preprint arXiv:1901.11518

work page arXiv 1901
[33]

N. H. Pham, L. M. Nguyen, D. T. Phan, Q. Tran-Dinh, ProxSA RAH: An efﬁcient algorithmic framework for stochastic composit e nonconvex optimization, arXiv preprint arXiv:1902.05679

work page internal anchor Pith review Pith/arXiv arXiv 1902
[34]

L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Optimal ﬁnite-sum smooth non-convex opti mization with sarah, arXiv preprint arXiv:1901.07648

work page internal anchor Pith review Pith/arXiv arXiv 1901
[35]

L. Xiao, T. Zhang, A proximal stochastic gradient metho d with progres- sive variance reduction, SIAM Journal on Optimization 24 (4 ) (2014) 2057–2075

work page 2014
[36]

Barzilai, J

J. Barzilai, J. M. Borwein, Two-point step size gradien t methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148. 10

work page 1988
[37]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai-Borwe in step size for mini-batch algorithms, Engineering Applications of Artiﬁcial Intelligence 72 (2018) 124 – 135

work page 2018
[38]

Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

H. Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

work page 1958
[39]

Robbins, S

H. Robbins, S. Monro, A stochastic approximation metho d, Annals of Mathematical Statistics (1951) 400–407

work page 1951
[40]

Benveniste, M

A. Benveniste, M. Mtivier, P . Priouret, Adaptive Algor ithms and Stochastic Approximations, Springer Berlin Heidelberg, 1 990

work page
[41]

Tieleman, G

T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide th e gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning

work page
[42]

A. P . George, W. B. Powell, Adaptive stepsizes for recur sive estima- tion with applications in approximate dynamic programming , Machine Learning 65 (1) (2006) 167–198

work page 2006
[43]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Mini-batch algorithms with online step size, Knowledge-Based Systems 165 (2019) 228–240

work page 2019
[44]

Sopyła, P

K. Sopyła, P . Drozda, Stochastic gradient descent with barzilai–borwein update step for svm, Information Sciences 316 (2015) 218–23 3

work page 2015
[45]

C. Tan, S. Ma, Y . H. Dai, Y . Qian, Barzilai-Borwein step size for stochas- tic gradient descent, in: Advances in Neural Information Pr ocessing Systems, 2016, pp. 685–693

work page 2016
[46]

Y ang, C

Z. Y ang, C. Wang, Y . Zang, J. Li, Mini-batch algorithms w ith Barzilai– Borwein update step, Neurocomputing 314 (2018) 177–185

work page 2018
[47]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Accelerated stochasti c gradient descent with step size selection rules, Signal Processing 1 59 (2019) 171–186

work page 2019
[48]

S. De, A. Y adav, D. Jacobs, T. Goldstein, Automated infe rence with adaptive batches, in: International Conference on Artiﬁci al Intelligence and Statistics, 2017

work page 2017
[49]

K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y . Y ao, Stoc hastic Non-convex Ordinal Embedding with Stabilized Barzilai-Bo rwein step size, in: AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[50]

Y ouseﬁan, A

F. Y ouseﬁan, A. Nedi´ c, U. V . Shanbhag, On stochastic gr adient and subgradient methods with adaptive steplength sequences, A utomatica 48 (1) (2012) 56–67

work page 2012
[51]

Mahsereci, P

M. Mahsereci, P . Hennig, Probabilistic line searches f or stochastic optimization, Journal of Machine Learning Research 18 (119 ) (2017) 1–59

work page 2017
[52]

A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient d escent, in: International Conference on Learning Representations, 20 18

work page
[53]

S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stocha stic variance reduction for nonconvex optimization, in: International C onference on Machine Learning, 2016, pp. 314–323

work page 2016
[54]

L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Par ameter adaptation in stochastic optimization, in: On-line learni ng in neural networks, Cambridge University Press, 1999, pp. 111–134

work page 1999
[55]

Schaul, S

T. Schaul, S. Zhang, Y . Lecun, No more pesky learning rat es, in: International Conference on Machine Learning, 2013, pp. 34 3–351

work page 2013
[56]

Tieleman, G

T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machin e learning, Technical Report

work page
[57]

Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of ke rnelization: Budgeted stochastic gradient descent for large-scale svm t raining, Jour- nal of Machine Learning Research 13 (Oct) (2012) 3103–3131

work page 2012
[58]

Bhojanapalli, B

S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optim ality of local search for low rank matrix recovery, in: Advances in Neural I nformation Processing Systems, 2016, pp. 3873–3881

work page 2016
[59]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the impo rtance of ini- tialization and momentum in deep learning, in: Internation al Conference on Machine Learning, 2013, pp. 1139–1147

work page 2013
[60]

Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

Y . Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

work page 2004
[61]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai–Borwe in step size for mini-batch algorithms, Engineering Applications of Artiﬁcial Intelligence 72 (2018) 124–135

work page 2018
[62]

Bordes, L

A. Bordes, L. Bottou, P . Gallinari, SGD-QN: Careful qua si-newton stochastic gradient descent, Journal of Machine Learning R esearch 10 (Jul) (2009) 1737–1754

work page 2009
[63]

R. H. Byrd, G. M. Chin, J. Nocedal, Y . Wu, Sample size sele ction in optimization methods for machine learning, Mathematical p rogramming 134 (1) (2012) 127–155

work page 2012
[64]

R. H. Byrd, S. L. Hansen, J. Nocedal, Y . Singer, A stochas tic quasi- newton method for large-scale optimization, SIAM Journal o n Opti- mization 26 (2) (2016) 1008–1031

work page 2016
[65]

Agarwal, B

N. Agarwal, B. Bullins, E. Hazan, Second-order stochas tic optimization for machine learning in linear time, The Journal of Machine L earning Research 18 (1) (2017) 4148–4187

work page 2017
[66]

Tripuraneni, M

N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan , Stochastic cubic regularization for fast nonconvex optimization, in: Advan ces in Neural Information Processing Systems, 2018, pp. 2899–2908

work page 2018
[67]

Schmidt, R

M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Cl ifton, A. Sarkar, Non-uniform stochastic average gradient method for training conditional random ﬁelds., in: AISTA TS, 2015

work page 2015

[1] [1]

Mairal, F

J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionar y learning for sparse coding, in: Proceedings of the 26th International Co nference on Machine Learning, ACM, 2009, pp. 689–696

work page 2009

[2] [2]

T. Zhang, Solving large scale linear prediction problem s using stochastic gradient descent algorithms, in: Proceedings of the 21st In ternational Conference on Machine Learning, ACM, 2004, p. 116

work page 2004

[3] [3]

P . F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ram anan, Object detection with discriminatively trained part-based model s, IEEE transac- tions on pattern analysis and machine intelligence 32 (9) (2 009) 1627– 1645

work page

[4] [4]

Blanz, T

V . Blanz, T. V etter, Face recognition based on ﬁtting a 3d morphable model, IEEE Transactions on Pattern Analysis and Machine In telligence 25 (9) (2003) 1063–1074

work page 2003

[5] [5]

Hoffman, F

M. Hoffman, F. R. Bach, D. M. Blei, Online learning for lat ent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864

work page 2010

[6] [6]

Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p

L. Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p. 177– 186

work page 2010

[7] [7]

Y . Lin, F. Lv, S. Zhu, M. Y ang, T. Cour, K. Y u, L. Cao, T. Huang, Large- scale image classiﬁcation: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696

work page 2011

[8] [8]

Zhang, P

Z. Zhang, P . Luo, C. C. Loy, X. Tang, Facial landmark detec tion by deep multi-task learning, in: European Conference on Compu ter Vision, Springer, 2014, pp. 94–108

work page 2014

[9] [9]

Tao, Q.-K

Q. Tao, Q.-K. Gao, D.-J. Chu, G.-W. Wu, Stochastic learni ng via optimizing the variational inequalities, IEEE Transactio ns on Neural Networks and Learning Systems 25 (10) (2014) 1769–1778

work page 2014

[10] [10]

M. F. Bugallo, V . Elvira, L. Martino, D. Luengo, J. Migue z, P . M. Djuric, Adaptive importance sampling: the past, the present, and th e future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79

work page 2017

[11] [11]

C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Netw orks and Learning Systems 29 (7) (2017) 3084–3096

work page 2017

[12] [12]

Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

X.-L. Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

work page 2018

[13] [13]

Moulines, F

E. Moulines, F. R. Bach, Non-asymptotic analysis of sto chastic ap- proximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459

work page 2011

[14] [14]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust s tochastic approximation approach to stochastic programming, SIAM Jo urnal on optimization 19 (4) (2009) 1574–1609

work page 2009

[15] [15]

Bottou, F

L. Bottou, F. E. Curtis, J. Nocedal, Optimization metho ds for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311

work page 2018

[16] [16]

Cotter, O

A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better mi ni-batch algorithms via accelerated gradient methods, in: Advances in Neural Information Processing Systems, 2011, pp. 1647–1655

work page 2011

[17] [17]

Koneˇ cn` y, J

J. Koneˇ cn` y, J. Liu, P . Richt´ arik, M. Tak´ aˇ c, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Se lected Topics in Signal Processing 10 (2) (2016) 242–255

work page 2016

[18] [18]

Needell, R

D. Needell, R. Ward, N. Srebro, Stochastic gradient des cent, weighted sampling, and the randomized kaczmarz algorithm, in: Advan ces in Neural Information Processing Systems, 2014, pp. 1017–102 5

work page 2014

[19] [19]

Csiba, P

D. Csiba, P . Richt´ arik, Importance sampling for minibatches, The Journal of Machine Learning Research 19 (1) (2018) 962–982

work page 2018

[20] [20]

T. Fu, Z. Zhang, Cpsg-mcmc: Clustering-based preproce ssing method for stochastic gradient mcmc, in: Artiﬁcial Intelligence a nd Statistics, 2017, pp. 841–850

work page 2017

[21] [21]

N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for ﬁnite training sets, in : Advances in Neural Information Processing Systems, 2012, pp. 2663–267 1

work page 2012

[22] [22]

Defazio, F

A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast inc remental gradient method with support for non-strongly convex compo site ob- jectives, in: Advances in Neural Information Processing Sy stems, 2014, pp. 1646–1654

work page 2014

[23] [23]

Shalev-Shwartz, T

S. Shalev-Shwartz, T. Zhang, Stochastic dual coordina te ascent methods for regularized loss minimization, Journal of Machine Lear ning Research 14 (Feb) (2013) 567–599

work page 2013

[24] [24]

Johnson, T

R. Johnson, T. Zhang, Accelerating stochastic gradien t descent using predictive variance reduction, in: Advances in Neural Info rmation Pro- cessing Systems, 2013, pp. 315–323

work page 2013

[25] [25]

Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582

work page 2014

[26] [26]

L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Sarah: A novel method for machine learning problems using stochastic recursive g radient, in: International Conference on Machine Learning-V olume 70, J MLR. org, 2017, pp. 2613–2621

work page 2017

[27] [27]

C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differentia l estimator, in: Advances in Neural Information Processing Systems, 2018, p p. 689– 699

work page 2018

[28] [28]

L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Finite-sum smooth optimization with sara h, def 1 (2019) 1

work page 2019

[29] [29]

L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Stochas tic recur- sive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

L. M. Nguyen, K. Scheinberg, M. Tak´ aˇ c, Inexact sarah a lgorithm for stochastic optimization, arXiv preprint arXiv:1811.1010 5

work page arXiv

[31] [31]

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

S. Horv´ ath, P . Richt´ arik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

D. Zhou, Q. Gu, Stochastic recursive variance-reduced cubic regulariza- tion methods, arXiv preprint arXiv:1901.11518

work page arXiv 1901

[33] [33]

N. H. Pham, L. M. Nguyen, D. T. Phan, Q. Tran-Dinh, ProxSA RAH: An efﬁcient algorithmic framework for stochastic composit e nonconvex optimization, arXiv preprint arXiv:1902.05679

work page internal anchor Pith review Pith/arXiv arXiv 1902

[34] [34]

L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Optimal ﬁnite-sum smooth non-convex opti mization with sarah, arXiv preprint arXiv:1901.07648

work page internal anchor Pith review Pith/arXiv arXiv 1901

[35] [35]

L. Xiao, T. Zhang, A proximal stochastic gradient metho d with progres- sive variance reduction, SIAM Journal on Optimization 24 (4 ) (2014) 2057–2075

work page 2014

[36] [36]

Barzilai, J

J. Barzilai, J. M. Borwein, Two-point step size gradien t methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148. 10

work page 1988

[37] [37]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai-Borwe in step size for mini-batch algorithms, Engineering Applications of Artiﬁcial Intelligence 72 (2018) 124 – 135

work page 2018

[38] [38]

Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

H. Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

work page 1958

[39] [39]

Robbins, S

H. Robbins, S. Monro, A stochastic approximation metho d, Annals of Mathematical Statistics (1951) 400–407

work page 1951

[40] [40]

Benveniste, M

A. Benveniste, M. Mtivier, P . Priouret, Adaptive Algor ithms and Stochastic Approximations, Springer Berlin Heidelberg, 1 990

work page

[41] [41]

Tieleman, G

T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide th e gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning

work page

[42] [42]

A. P . George, W. B. Powell, Adaptive stepsizes for recur sive estima- tion with applications in approximate dynamic programming , Machine Learning 65 (1) (2006) 167–198

work page 2006

[43] [43]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Mini-batch algorithms with online step size, Knowledge-Based Systems 165 (2019) 228–240

work page 2019

[44] [44]

Sopyła, P

K. Sopyła, P . Drozda, Stochastic gradient descent with barzilai–borwein update step for svm, Information Sciences 316 (2015) 218–23 3

work page 2015

[45] [45]

C. Tan, S. Ma, Y . H. Dai, Y . Qian, Barzilai-Borwein step size for stochas- tic gradient descent, in: Advances in Neural Information Pr ocessing Systems, 2016, pp. 685–693

work page 2016

[46] [46]

Y ang, C

Z. Y ang, C. Wang, Y . Zang, J. Li, Mini-batch algorithms w ith Barzilai– Borwein update step, Neurocomputing 314 (2018) 177–185

work page 2018

[47] [47]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Accelerated stochasti c gradient descent with step size selection rules, Signal Processing 1 59 (2019) 171–186

work page 2019

[48] [48]

S. De, A. Y adav, D. Jacobs, T. Goldstein, Automated infe rence with adaptive batches, in: International Conference on Artiﬁci al Intelligence and Statistics, 2017

work page 2017

[49] [49]

K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y . Y ao, Stoc hastic Non-convex Ordinal Embedding with Stabilized Barzilai-Bo rwein step size, in: AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[50] [50]

Y ouseﬁan, A

F. Y ouseﬁan, A. Nedi´ c, U. V . Shanbhag, On stochastic gr adient and subgradient methods with adaptive steplength sequences, A utomatica 48 (1) (2012) 56–67

work page 2012

[51] [51]

Mahsereci, P

M. Mahsereci, P . Hennig, Probabilistic line searches f or stochastic optimization, Journal of Machine Learning Research 18 (119 ) (2017) 1–59

work page 2017

[52] [52]

A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient d escent, in: International Conference on Learning Representations, 20 18

work page

[53] [53]

S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stocha stic variance reduction for nonconvex optimization, in: International C onference on Machine Learning, 2016, pp. 314–323

work page 2016

[54] [54]

L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Par ameter adaptation in stochastic optimization, in: On-line learni ng in neural networks, Cambridge University Press, 1999, pp. 111–134

work page 1999

[55] [55]

Schaul, S

T. Schaul, S. Zhang, Y . Lecun, No more pesky learning rat es, in: International Conference on Machine Learning, 2013, pp. 34 3–351

work page 2013

[56] [56]

Tieleman, G

T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machin e learning, Technical Report

work page

[57] [57]

Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of ke rnelization: Budgeted stochastic gradient descent for large-scale svm t raining, Jour- nal of Machine Learning Research 13 (Oct) (2012) 3103–3131

work page 2012

[58] [58]

Bhojanapalli, B

S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optim ality of local search for low rank matrix recovery, in: Advances in Neural I nformation Processing Systems, 2016, pp. 3873–3881

work page 2016

[59] [59]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the impo rtance of ini- tialization and momentum in deep learning, in: Internation al Conference on Machine Learning, 2013, pp. 1139–1147

work page 2013

[60] [60]

Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

Y . Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

work page 2004

[61] [61]

Y ang, C

Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai–Borwe in step size for mini-batch algorithms, Engineering Applications of Artiﬁcial Intelligence 72 (2018) 124–135

work page 2018

[62] [62]

Bordes, L

A. Bordes, L. Bottou, P . Gallinari, SGD-QN: Careful qua si-newton stochastic gradient descent, Journal of Machine Learning R esearch 10 (Jul) (2009) 1737–1754

work page 2009

[63] [63]

R. H. Byrd, G. M. Chin, J. Nocedal, Y . Wu, Sample size sele ction in optimization methods for machine learning, Mathematical p rogramming 134 (1) (2012) 127–155

work page 2012

[64] [64]

R. H. Byrd, S. L. Hansen, J. Nocedal, Y . Singer, A stochas tic quasi- newton method for large-scale optimization, SIAM Journal o n Opti- mization 26 (2) (2016) 1008–1031

work page 2016

[65] [65]

Agarwal, B

N. Agarwal, B. Bullins, E. Hazan, Second-order stochas tic optimization for machine learning in linear time, The Journal of Machine L earning Research 18 (1) (2017) 4148–4187

work page 2017

[66] [66]

Tripuraneni, M

N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan , Stochastic cubic regularization for fast nonconvex optimization, in: Advan ces in Neural Information Processing Systems, 2018, pp. 2899–2908

work page 2018

[67] [67]

Schmidt, R

M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Cl ifton, A. Sarkar, Non-uniform stochastic average gradient method for training conditional random ﬁelds., in: AISTA TS, 2015

work page 2015