Accelerating Mini-batch SARAH by Step Size Rules
Pith reviewed 2026-05-25 20:01 UTC · model grok-4.3
The pith
Mini-batch SARAH with a random Barzilai-Borwein step size rule converges linearly in expectation for strongly convex functions and has better complexity than the original method.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing fixed or manually chosen step sizes with the Random Barzilai-Borwein (RBB) rule inside the mini-batch SARAH framework, the resulting MB-SARAH-RBB algorithm converges linearly in expectation for strongly convex objectives, achieves strictly better iteration complexity than the baseline MB-SARAH, and performs competitively on standard data sets.
What carries the argument
The Random Barzilai-Borwein (RBB) step-size rule, which generates per-iteration step lengths from two recent gradient estimates so that the linear-convergence conditions of mini-batch SARAH remain satisfied.
If this is right
- Linear convergence in expectation holds for strongly convex objectives.
- Complexity is strictly better than the original mini-batch SARAH.
- The algorithm matches or beats state-of-the-art methods on standard data sets without step-size tuning.
Where Pith is reading between the lines
- The same adaptive rule could be tested inside other variance-reduced recursions such as SVRG or SPIDER to remove their own step-size tuning burden.
- Because the original SARAH framework already applies to non-convex problems, the RBB variant might be tried there even though the linear-rate proof does not cover that case.
- The reduction in manual tuning lowers the practical barrier to deploying SARAH-style methods on new data sets where cross-validation is expensive.
- The random choice inside the BB formula may add a mild smoothing effect that helps stability on ill-conditioned problems, an effect worth measuring separately.
Load-bearing premise
The random Barzilai-Borwein step-size rule produces values that satisfy the conditions needed for the linear convergence analysis of mini-batch SARAH to remain valid.
What would settle it
Run MB-SARAH-RBB on a simple strongly convex quadratic and observe whether the measured convergence rate is linear with the constant predicted by the theorem, or whether the step sizes occasionally violate the required bounds and cause the error to stop decreasing linearly.
Figures
read the original abstract
StochAstic Recursive grAdient algoritHm (SARAH), originally proposed for convex optimization and also proven to be effective for general nonconvex optimization, has received great attention due to its simple recursive framework for updating stochastic gradient estimates. The performance of SARAH significantly depends on the choice of step size sequence. However, SARAH and its variants often employ a best-tuned step size by mentor, which is time consuming in practice. Motivated by this gap, we proposed a variant of the Barzilai-Borwein (BB) method, referred to as the Random Barzilai-Borwein (RBB) method, to calculate step size for SARAH in the mini-batch setting, thereby leading to a new SARAH method: MB-SARAH-RBB. We prove that MB-SARAH-RBB converges linearly in expectation for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that it is better than the original method. Numerical experiments on standard data sets indicate that MB-SARAH-RBB outperforms or matches state-of-the-art algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MB-SARAH-RBB, a mini-batch variant of SARAH that replaces fixed or tuned step sizes with a Random Barzilai-Borwein (RBB) rule for automatic step-size selection. It claims to prove linear convergence in expectation for strongly convex objectives, to derive a complexity bound that improves on the original MB-SARAH, and to show competitive or superior empirical performance on standard datasets.
Significance. If the convergence claim holds, the work supplies a practical, tuning-free step-size mechanism that preserves the linear rate of mini-batch SARAH while improving practical usability; the complexity improvement and numerical results would strengthen the case for data-dependent step sizes in variance-reduced methods.
major comments (2)
- [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.
- [§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.
minor comments (2)
- [§3] Notation for the RBB step-size formula is introduced without an explicit equation number; cross-referencing would improve readability.
- [§6] The numerical experiments section would benefit from reporting the observed range of realized RBB step sizes to allow readers to check consistency with the theoretical interval.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.
Authors: We agree that the proof in the current manuscript relies on the step-size interval from the original MB-SARAH analysis without explicitly verifying that RBB realizations remain inside it almost surely. In the revised version we will insert a new lemma establishing that the random Barzilai-Borwein step sizes satisfy the required bound with probability 1 (by exploiting the positive-definiteness of the mini-batch Hessian approximations and the boundedness of the gradient differences). If this deterministic guarantee proves too restrictive, we will instead extend the convergence argument to random step sizes by controlling the additional variance terms that appear when taking expectation over η_k. revision: yes
-
Referee: [§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.
Authors: The complexity claim in the manuscript is obtained by taking the expectation of the contraction factor with respect to the distribution of the RBB step sizes. To address the concern, the revised version will explicitly recompute the iteration complexity bound under this distribution, showing that the improvement holds in expectation without requiring further assumptions beyond those already stated for the RBB rule. revision: yes
Circularity Check
No circularity: convergence proof for MB-SARAH-RBB is presented as independent analysis
full rationale
The paper proposes the RBB step-size rule as a new variant and separately claims to prove linear convergence in expectation plus improved complexity for strongly convex objectives. No equations, fitted parameters, or self-citations are visible that would reduce the claimed proof to a definition, a renamed input, or an unverified self-citation chain. The derivation chain therefore remains self-contained against external benchmarks such as the original SARAH analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The objective function is strongly convex.
Reference graph
Works this paper leans on
- [1]
-
[2]
T. Zhang, Solving large scale linear prediction problem s using stochastic gradient descent algorithms, in: Proceedings of the 21st In ternational Conference on Machine Learning, ACM, 2004, p. 116
work page 2004
-
[3]
P . F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ram anan, Object detection with discriminatively trained part-based model s, IEEE transac- tions on pattern analysis and machine intelligence 32 (9) (2 009) 1627– 1645
- [4]
-
[5]
M. Hoffman, F. R. Bach, D. M. Blei, Online learning for lat ent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864
work page 2010
-
[6]
L. Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p. 177– 186
work page 2010
-
[7]
Y . Lin, F. Lv, S. Zhu, M. Y ang, T. Cour, K. Y u, L. Cao, T. Huang, Large- scale image classification: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696
work page 2011
- [8]
- [9]
-
[10]
M. F. Bugallo, V . Elvira, L. Martino, D. Luengo, J. Migue z, P . M. Djuric, Adaptive importance sampling: the past, the present, and th e future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79
work page 2017
-
[11]
C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Netw orks and Learning Systems 29 (7) (2017) 3084–3096
work page 2017
-
[12]
X.-L. Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466
work page 2018
-
[13]
E. Moulines, F. R. Bach, Non-asymptotic analysis of sto chastic ap- proximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459
work page 2011
-
[14]
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust s tochastic approximation approach to stochastic programming, SIAM Jo urnal on optimization 19 (4) (2009) 1574–1609
work page 2009
- [15]
- [16]
-
[17]
J. Koneˇ cn` y, J. Liu, P . Richt´ arik, M. Tak´ aˇ c, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Se lected Topics in Signal Processing 10 (2) (2016) 242–255
work page 2016
-
[18]
D. Needell, R. Ward, N. Srebro, Stochastic gradient des cent, weighted sampling, and the randomized kaczmarz algorithm, in: Advan ces in Neural Information Processing Systems, 2014, pp. 1017–102 5
work page 2014
- [19]
-
[20]
T. Fu, Z. Zhang, Cpsg-mcmc: Clustering-based preproce ssing method for stochastic gradient mcmc, in: Artificial Intelligence a nd Statistics, 2017, pp. 841–850
work page 2017
-
[21]
N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in : Advances in Neural Information Processing Systems, 2012, pp. 2663–267 1
work page 2012
-
[22]
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast inc remental gradient method with support for non-strongly convex compo site ob- jectives, in: Advances in Neural Information Processing Sy stems, 2014, pp. 1646–1654
work page 2014
-
[23]
S. Shalev-Shwartz, T. Zhang, Stochastic dual coordina te ascent methods for regularized loss minimization, Journal of Machine Lear ning Research 14 (Feb) (2013) 567–599
work page 2013
-
[24]
R. Johnson, T. Zhang, Accelerating stochastic gradien t descent using predictive variance reduction, in: Advances in Neural Info rmation Pro- cessing Systems, 2013, pp. 315–323
work page 2013
-
[25]
A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582
work page 2014
-
[26]
L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Sarah: A novel method for machine learning problems using stochastic recursive g radient, in: International Conference on Machine Learning-V olume 70, J MLR. org, 2017, pp. 2613–2621
work page 2017
-
[27]
C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differentia l estimator, in: Advances in Neural Information Processing Systems, 2018, p p. 689– 699
work page 2018
-
[28]
L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Finite-sum smooth optimization with sara h, def 1 (2019) 1
work page 2019
-
[29]
L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Stochas tic recur- sive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261
work page internal anchor Pith review Pith/arXiv arXiv
- [30]
-
[31]
Nonconvex Variance Reduced Optimization with Arbitrary Sampling
S. Horv´ ath, P . Richt´ arik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146
work page internal anchor Pith review Pith/arXiv arXiv
- [32]
-
[33]
N. H. Pham, L. M. Nguyen, D. T. Phan, Q. Tran-Dinh, ProxSA RAH: An efficient algorithmic framework for stochastic composit e nonconvex optimization, arXiv preprint arXiv:1902.05679
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[34]
L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Optimal finite-sum smooth non-convex opti mization with sarah, arXiv preprint arXiv:1901.07648
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[35]
L. Xiao, T. Zhang, A proximal stochastic gradient metho d with progres- sive variance reduction, SIAM Journal on Optimization 24 (4 ) (2014) 2057–2075
work page 2014
-
[36]
J. Barzilai, J. M. Borwein, Two-point step size gradien t methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148. 10
work page 1988
- [37]
-
[38]
H. Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59
work page 1958
-
[39]
H. Robbins, S. Monro, A stochastic approximation metho d, Annals of Mathematical Statistics (1951) 400–407
work page 1951
-
[40]
A. Benveniste, M. Mtivier, P . Priouret, Adaptive Algor ithms and Stochastic Approximations, Springer Berlin Heidelberg, 1 990
-
[41]
T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide th e gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning
-
[42]
A. P . George, W. B. Powell, Adaptive stepsizes for recur sive estima- tion with applications in approximate dynamic programming , Machine Learning 65 (1) (2006) 167–198
work page 2006
- [43]
- [44]
-
[45]
C. Tan, S. Ma, Y . H. Dai, Y . Qian, Barzilai-Borwein step size for stochas- tic gradient descent, in: Advances in Neural Information Pr ocessing Systems, 2016, pp. 685–693
work page 2016
- [46]
- [47]
-
[48]
S. De, A. Y adav, D. Jacobs, T. Goldstein, Automated infe rence with adaptive batches, in: International Conference on Artifici al Intelligence and Statistics, 2017
work page 2017
-
[49]
K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y . Y ao, Stoc hastic Non-convex Ordinal Embedding with Stabilized Barzilai-Bo rwein step size, in: AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[50]
F. Y ousefian, A. Nedi´ c, U. V . Shanbhag, On stochastic gr adient and subgradient methods with adaptive steplength sequences, A utomatica 48 (1) (2012) 56–67
work page 2012
-
[51]
M. Mahsereci, P . Hennig, Probabilistic line searches f or stochastic optimization, Journal of Machine Learning Research 18 (119 ) (2017) 1–59
work page 2017
-
[52]
A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient d escent, in: International Conference on Learning Representations, 20 18
-
[53]
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stocha stic variance reduction for nonconvex optimization, in: International C onference on Machine Learning, 2016, pp. 314–323
work page 2016
-
[54]
L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Par ameter adaptation in stochastic optimization, in: On-line learni ng in neural networks, Cambridge University Press, 1999, pp. 111–134
work page 1999
- [55]
-
[56]
T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machin e learning, Technical Report
-
[57]
Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of ke rnelization: Budgeted stochastic gradient descent for large-scale svm t raining, Jour- nal of Machine Learning Research 13 (Oct) (2012) 3103–3131
work page 2012
-
[58]
S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optim ality of local search for low rank matrix recovery, in: Advances in Neural I nformation Processing Systems, 2016, pp. 3873–3881
work page 2016
-
[59]
I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the impo rtance of ini- tialization and momentum in deep learning, in: Internation al Conference on Machine Learning, 2013, pp. 1139–1147
work page 2013
-
[60]
Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004
Y . Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004
work page 2004
- [61]
- [62]
-
[63]
R. H. Byrd, G. M. Chin, J. Nocedal, Y . Wu, Sample size sele ction in optimization methods for machine learning, Mathematical p rogramming 134 (1) (2012) 127–155
work page 2012
-
[64]
R. H. Byrd, S. L. Hansen, J. Nocedal, Y . Singer, A stochas tic quasi- newton method for large-scale optimization, SIAM Journal o n Opti- mization 26 (2) (2016) 1008–1031
work page 2016
-
[65]
N. Agarwal, B. Bullins, E. Hazan, Second-order stochas tic optimization for machine learning in linear time, The Journal of Machine L earning Research 18 (1) (2017) 4148–4187
work page 2017
-
[66]
N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan , Stochastic cubic regularization for fast nonconvex optimization, in: Advan ces in Neural Information Processing Systems, 2018, pp. 2899–2908
work page 2018
-
[67]
M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Cl ifton, A. Sarkar, Non-uniform stochastic average gradient method for training conditional random fields., in: AISTA TS, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.