pith. sign in

arxiv: 1906.08496 · v1 · pith:MPXX44LMnew · submitted 2019-06-20 · 💻 cs.LG · math.OC· stat.ML

Accelerating Mini-batch SARAH by Step Size Rules

Pith reviewed 2026-05-25 20:01 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords SARAHmini-batch optimizationBarzilai-Borwein step sizestochastic gradientlinear convergencestrongly convexvariance reductionstep size selection
0
0 comments X

The pith

Mini-batch SARAH with a random Barzilai-Borwein step size rule converges linearly in expectation for strongly convex functions and has better complexity than the original method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MB-SARAH-RBB, a mini-batch version of SARAH that replaces manual step-size selection with the Random Barzilai-Borwein rule for computing step sizes on the fly. It proves that this change preserves linear convergence in expectation when the objective is strongly convex. The complexity analysis shows the new method requires fewer gradient evaluations than standard mini-batch SARAH to reach a given accuracy. Experiments on common datasets confirm that MB-SARAH-RBB matches or exceeds the performance of current algorithms without any hand-tuned parameters.

Core claim

By replacing fixed or manually chosen step sizes with the Random Barzilai-Borwein (RBB) rule inside the mini-batch SARAH framework, the resulting MB-SARAH-RBB algorithm converges linearly in expectation for strongly convex objectives, achieves strictly better iteration complexity than the baseline MB-SARAH, and performs competitively on standard data sets.

What carries the argument

The Random Barzilai-Borwein (RBB) step-size rule, which generates per-iteration step lengths from two recent gradient estimates so that the linear-convergence conditions of mini-batch SARAH remain satisfied.

If this is right

  • Linear convergence in expectation holds for strongly convex objectives.
  • Complexity is strictly better than the original mini-batch SARAH.
  • The algorithm matches or beats state-of-the-art methods on standard data sets without step-size tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive rule could be tested inside other variance-reduced recursions such as SVRG or SPIDER to remove their own step-size tuning burden.
  • Because the original SARAH framework already applies to non-convex problems, the RBB variant might be tried there even though the linear-rate proof does not cover that case.
  • The reduction in manual tuning lowers the practical barrier to deploying SARAH-style methods on new data sets where cross-validation is expensive.
  • The random choice inside the BB formula may add a mild smoothing effect that helps stability on ill-conditioned problems, an effect worth measuring separately.

Load-bearing premise

The random Barzilai-Borwein step-size rule produces values that satisfy the conditions needed for the linear convergence analysis of mini-batch SARAH to remain valid.

What would settle it

Run MB-SARAH-RBB on a simple strongly convex quadratic and observe whether the measured convergence rate is linear with the constant predicted by the theorem, or whether the step sizes occasionally violate the required bounds and cause the error to stop decreasing linearly.

Figures

Figures reproduced from arXiv: 1906.08496 by Cheng Wang, Zengping Chen, Zhuang Yang.

Figure 1
Figure 1. Figure 1: Comparison of MB-SARAH-RBB and MB-SARAH with fixed st [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of MB-SARAH-RBB and MB-SARAH with fixed st [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of MB-SARAH-RBB and MB-SARAH with fixed st [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Different initial step sizes for MB-SARAH-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of MB-SARAH-RBB and mS2GD-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of MB-SARAH-RBB and mS2GD-RBB on [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different methods on three data sets: [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

StochAstic Recursive grAdient algoritHm (SARAH), originally proposed for convex optimization and also proven to be effective for general nonconvex optimization, has received great attention due to its simple recursive framework for updating stochastic gradient estimates. The performance of SARAH significantly depends on the choice of step size sequence. However, SARAH and its variants often employ a best-tuned step size by mentor, which is time consuming in practice. Motivated by this gap, we proposed a variant of the Barzilai-Borwein (BB) method, referred to as the Random Barzilai-Borwein (RBB) method, to calculate step size for SARAH in the mini-batch setting, thereby leading to a new SARAH method: MB-SARAH-RBB. We prove that MB-SARAH-RBB converges linearly in expectation for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that it is better than the original method. Numerical experiments on standard data sets indicate that MB-SARAH-RBB outperforms or matches state-of-the-art algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MB-SARAH-RBB, a mini-batch variant of SARAH that replaces fixed or tuned step sizes with a Random Barzilai-Borwein (RBB) rule for automatic step-size selection. It claims to prove linear convergence in expectation for strongly convex objectives, to derive a complexity bound that improves on the original MB-SARAH, and to show competitive or superior empirical performance on standard datasets.

Significance. If the convergence claim holds, the work supplies a practical, tuning-free step-size mechanism that preserves the linear rate of mini-batch SARAH while improving practical usability; the complexity improvement and numerical results would strengthen the case for data-dependent step sizes in variance-reduced methods.

major comments (2)
  1. [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.
  2. [§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.
minor comments (2)
  1. [§3] Notation for the RBB step-size formula is introduced without an explicit equation number; cross-referencing would improve readability.
  2. [§6] The numerical experiments section would benefit from reporting the observed range of realized RBB step sizes to allow readers to check consistency with the theoretical interval.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the linear-convergence proof invokes the step-size restriction required by the original MB-SARAH analysis (typically 0 < η ≤ 2μ/L or an analogous explicit interval). The manuscript does not supply a separate argument showing that every realization of the random RBB step size η_k lies inside this interval with probability 1, nor does it extend the analysis to accommodate a random η_k while controlling the additional error terms that arise. This compatibility is load-bearing for the main theorem.

    Authors: We agree that the proof in the current manuscript relies on the step-size interval from the original MB-SARAH analysis without explicitly verifying that RBB realizations remain inside it almost surely. In the revised version we will insert a new lemma establishing that the random Barzilai-Borwein step sizes satisfy the required bound with probability 1 (by exploiting the positive-definiteness of the mini-batch Hessian approximations and the boundedness of the gradient differences). If this deterministic guarantee proves too restrictive, we will instead extend the convergence argument to random step sizes by controlling the additional variance terms that appear when taking expectation over η_k. revision: yes

  2. Referee: [§5] §5 (Complexity Analysis): the claimed complexity improvement is stated relative to the original method, yet the derivation does not explicitly recompute the iteration complexity under the random step-size distribution; it is therefore unclear whether the improvement is strict or holds only in expectation under additional assumptions on the RBB rule.

    Authors: The complexity claim in the manuscript is obtained by taking the expectation of the contraction factor with respect to the distribution of the RBB step sizes. To address the concern, the revised version will explicitly recompute the iteration complexity bound under this distribution, showing that the improvement holds in expectation without requiring further assumptions beyond those already stated for the RBB rule. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence proof for MB-SARAH-RBB is presented as independent analysis

full rationale

The paper proposes the RBB step-size rule as a new variant and separately claims to prove linear convergence in expectation plus improved complexity for strongly convex objectives. No equations, fitted parameters, or self-citations are visible that would reduce the claimed proof to a definition, a renamed input, or an unverified self-citation chain. The derivation chain therefore remains self-contained against external benchmarks such as the original SARAH analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the ledger is therefore limited to the single domain assumption stated in the abstract.

axioms (1)
  • domain assumption The objective function is strongly convex.
    Explicitly required for the linear convergence claim in the abstract.

pith-pipeline@v0.9.0 · 5723 in / 1134 out tokens · 42591 ms · 2026-05-25T20:01:07.047489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

  1. [1]

    Mairal, F

    J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionar y learning for sparse coding, in: Proceedings of the 26th International Co nference on Machine Learning, ACM, 2009, pp. 689–696

  2. [2]

    T. Zhang, Solving large scale linear prediction problem s using stochastic gradient descent algorithms, in: Proceedings of the 21st In ternational Conference on Machine Learning, ACM, 2004, p. 116

  3. [3]

    P . F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ram anan, Object detection with discriminatively trained part-based model s, IEEE transac- tions on pattern analysis and machine intelligence 32 (9) (2 009) 1627– 1645

  4. [4]

    Blanz, T

    V . Blanz, T. V etter, Face recognition based on fitting a 3d morphable model, IEEE Transactions on Pattern Analysis and Machine In telligence 25 (9) (2003) 1063–1074

  5. [5]

    Hoffman, F

    M. Hoffman, F. R. Bach, D. M. Blei, Online learning for lat ent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 856–864

  6. [6]

    Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p

    L. Bottou, Large-scale machine learning with stochasti c gradient de- scent, in: Proceedings of COMPSTA T’2010, Springer, 2010, p p. 177– 186

  7. [7]

    Y . Lin, F. Lv, S. Zhu, M. Y ang, T. Cour, K. Y u, L. Cao, T. Huang, Large- scale image classification: fast feature extraction and svm training, in: CVPR 2011, IEEE, 2011, pp. 1689–1696

  8. [8]

    Zhang, P

    Z. Zhang, P . Luo, C. C. Loy, X. Tang, Facial landmark detec tion by deep multi-task learning, in: European Conference on Compu ter Vision, Springer, 2014, pp. 94–108

  9. [9]

    Tao, Q.-K

    Q. Tao, Q.-K. Gao, D.-J. Chu, G.-W. Wu, Stochastic learni ng via optimizing the variational inequalities, IEEE Transactio ns on Neural Networks and Learning Systems 25 (10) (2014) 1769–1778

  10. [10]

    M. F. Bugallo, V . Elvira, L. Martino, D. Luengo, J. Migue z, P . M. Djuric, Adaptive importance sampling: the past, the present, and th e future, IEEE Signal Processing Magazine 34 (4) (2017) 60–79

  11. [11]

    C. Du, J. Zhu, B. Zhang, Learning deep generative models with doubly stochastic gradient mcmc, IEEE Transactions on Neural Netw orks and Learning Systems 29 (7) (2017) 3084–3096

  12. [12]

    Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

    X.-L. Li, Preconditioned stochastic gradient descent , IEEE Transactions on Neural Networks and Learning Systems 29 (5) (2018) 1454–1 466

  13. [13]

    Moulines, F

    E. Moulines, F. R. Bach, Non-asymptotic analysis of sto chastic ap- proximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459

  14. [14]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust s tochastic approximation approach to stochastic programming, SIAM Jo urnal on optimization 19 (4) (2009) 1574–1609

  15. [15]

    Bottou, F

    L. Bottou, F. E. Curtis, J. Nocedal, Optimization metho ds for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311

  16. [16]

    Cotter, O

    A. Cotter, O. Shamir, N. Srebro, K. Sridharan, Better mi ni-batch algorithms via accelerated gradient methods, in: Advances in Neural Information Processing Systems, 2011, pp. 1647–1655

  17. [17]

    Koneˇ cn` y, J

    J. Koneˇ cn` y, J. Liu, P . Richt´ arik, M. Tak´ aˇ c, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Se lected Topics in Signal Processing 10 (2) (2016) 242–255

  18. [18]

    Needell, R

    D. Needell, R. Ward, N. Srebro, Stochastic gradient des cent, weighted sampling, and the randomized kaczmarz algorithm, in: Advan ces in Neural Information Processing Systems, 2014, pp. 1017–102 5

  19. [19]

    Csiba, P

    D. Csiba, P . Richt´ arik, Importance sampling for minibatches, The Journal of Machine Learning Research 19 (1) (2018) 962–982

  20. [20]

    T. Fu, Z. Zhang, Cpsg-mcmc: Clustering-based preproce ssing method for stochastic gradient mcmc, in: Artificial Intelligence a nd Statistics, 2017, pp. 841–850

  21. [21]

    N. L. Roux, M. Schmidt, F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in : Advances in Neural Information Processing Systems, 2012, pp. 2663–267 1

  22. [22]

    Defazio, F

    A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast inc remental gradient method with support for non-strongly convex compo site ob- jectives, in: Advances in Neural Information Processing Sy stems, 2014, pp. 1646–1654

  23. [23]

    Shalev-Shwartz, T

    S. Shalev-Shwartz, T. Zhang, Stochastic dual coordina te ascent methods for regularized loss minimization, Journal of Machine Lear ning Research 14 (Feb) (2013) 567–599

  24. [24]

    Johnson, T

    R. Johnson, T. Zhang, Accelerating stochastic gradien t descent using predictive variance reduction, in: Advances in Neural Info rmation Pro- cessing Systems, 2013, pp. 315–323

  25. [25]

    Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp

    A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in: Advances in Neural Information Processing Systems, 2014, pp. 1574–1582

  26. [26]

    L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Sarah: A novel method for machine learning problems using stochastic recursive g radient, in: International Conference on Machine Learning-V olume 70, J MLR. org, 2017, pp. 2613–2621

  27. [27]

    C. Fang, C. J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differentia l estimator, in: Advances in Neural Information Processing Systems, 2018, p p. 689– 699

  28. [28]

    L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Finite-sum smooth optimization with sara h, def 1 (2019) 1

  29. [29]

    L. M. Nguyen, J. Liu, K. Scheinberg, M. Tak´ aˇ c, Stochas tic recur- sive gradient algorithm for nonconvex optimization, arXiv preprint arXiv:1705.07261

  30. [30]

    L. M. Nguyen, K. Scheinberg, M. Tak´ aˇ c, Inexact sarah a lgorithm for stochastic optimization, arXiv preprint arXiv:1811.1010 5

  31. [31]

    Nonconvex Variance Reduced Optimization with Arbitrary Sampling

    S. Horv´ ath, P . Richt´ arik, Nonconvex variance reduced optimization with arbitrary sampling, arXiv preprint arXiv:1809.04146

  32. [32]

    D. Zhou, Q. Gu, Stochastic recursive variance-reduced cubic regulariza- tion methods, arXiv preprint arXiv:1901.11518

  33. [33]

    N. H. Pham, L. M. Nguyen, D. T. Phan, Q. Tran-Dinh, ProxSA RAH: An efficient algorithmic framework for stochastic composit e nonconvex optimization, arXiv preprint arXiv:1902.05679

  34. [34]

    L. M. Nguyen, M. van Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, J. R. Kalagnanam, Optimal finite-sum smooth non-convex opti mization with sarah, arXiv preprint arXiv:1901.07648

  35. [35]

    L. Xiao, T. Zhang, A proximal stochastic gradient metho d with progres- sive variance reduction, SIAM Journal on Optimization 24 (4 ) (2014) 2057–2075

  36. [36]

    Barzilai, J

    J. Barzilai, J. M. Borwein, Two-point step size gradien t methods, IMA Journal of Numerical Analysis 8 (1) (1988) 141–148. 10

  37. [37]

    Y ang, C

    Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai-Borwe in step size for mini-batch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124 – 135

  38. [38]

    Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

    H. Kesten, Accelerated stochastic approximation, Ann als of Mathemat- ical Statistics 29 (1) (1958) 41–59

  39. [39]

    Robbins, S

    H. Robbins, S. Monro, A stochastic approximation metho d, Annals of Mathematical Statistics (1951) 400–407

  40. [40]

    Benveniste, M

    A. Benveniste, M. Mtivier, P . Priouret, Adaptive Algor ithms and Stochastic Approximations, Springer Berlin Heidelberg, 1 990

  41. [41]

    Tieleman, G

    T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide th e gradient by a running average of its recent magnitude., COURSERA: Neural Networks for Machine Learning

  42. [42]

    A. P . George, W. B. Powell, Adaptive stepsizes for recur sive estima- tion with applications in approximate dynamic programming , Machine Learning 65 (1) (2006) 167–198

  43. [43]

    Y ang, C

    Z. Y ang, C. Wang, Z. Zhang, J. Li, Mini-batch algorithms with online step size, Knowledge-Based Systems 165 (2019) 228–240

  44. [44]

    Sopyła, P

    K. Sopyła, P . Drozda, Stochastic gradient descent with barzilai–borwein update step for svm, Information Sciences 316 (2015) 218–23 3

  45. [45]

    C. Tan, S. Ma, Y . H. Dai, Y . Qian, Barzilai-Borwein step size for stochas- tic gradient descent, in: Advances in Neural Information Pr ocessing Systems, 2016, pp. 685–693

  46. [46]

    Y ang, C

    Z. Y ang, C. Wang, Y . Zang, J. Li, Mini-batch algorithms w ith Barzilai– Borwein update step, Neurocomputing 314 (2018) 177–185

  47. [47]

    Y ang, C

    Z. Y ang, C. Wang, Z. Zhang, J. Li, Accelerated stochasti c gradient descent with step size selection rules, Signal Processing 1 59 (2019) 171–186

  48. [48]

    S. De, A. Y adav, D. Jacobs, T. Goldstein, Automated infe rence with adaptive batches, in: International Conference on Artifici al Intelligence and Statistics, 2017

  49. [49]

    K. Ma, J. Zeng, J. Xiong, Q. Xu, X. Cao, W. Liu, Y . Y ao, Stoc hastic Non-convex Ordinal Embedding with Stabilized Barzilai-Bo rwein step size, in: AAAI Conference on Artificial Intelligence, 2018

  50. [50]

    Y ousefian, A

    F. Y ousefian, A. Nedi´ c, U. V . Shanbhag, On stochastic gr adient and subgradient methods with adaptive steplength sequences, A utomatica 48 (1) (2012) 56–67

  51. [51]

    Mahsereci, P

    M. Mahsereci, P . Hennig, Probabilistic line searches f or stochastic optimization, Journal of Machine Learning Research 18 (119 ) (2017) 1–59

  52. [52]

    A. G. Baydin, R. Cornish, D. M. Rubio, M. W. Schmidt, F. D. Wood, Online learning rate adaptation with hypergradient d escent, in: International Conference on Learning Representations, 20 18

  53. [53]

    S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stocha stic variance reduction for nonconvex optimization, in: International C onference on Machine Learning, 2016, pp. 314–323

  54. [54]

    L. B. Almeida, T. Langlois, J. D. Amaral, A. Plakhov, Par ameter adaptation in stochastic optimization, in: On-line learni ng in neural networks, Cambridge University Press, 1999, pp. 111–134

  55. [55]

    Schaul, S

    T. Schaul, S. Zhang, Y . Lecun, No more pesky learning rat es, in: International Conference on Machine Learning, 2013, pp. 34 3–351

  56. [56]

    Tieleman, G

    T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machin e learning, Technical Report

  57. [57]

    Z. Wang, K. Crammer, S. Vucetic, Breaking the curse of ke rnelization: Budgeted stochastic gradient descent for large-scale svm t raining, Jour- nal of Machine Learning Research 13 (Oct) (2012) 3103–3131

  58. [58]

    Bhojanapalli, B

    S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optim ality of local search for low rank matrix recovery, in: Advances in Neural I nformation Processing Systems, 2016, pp. 3873–3881

  59. [59]

    Sutskever, J

    I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the impo rtance of ini- tialization and momentum in deep learning, in: Internation al Conference on Machine Learning, 2013, pp. 1139–1147

  60. [60]

    Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

    Y . Nesterov, Introductory lectures on convex optimiza tion : basic course, Kluwer Academic, 2004

  61. [61]

    Y ang, C

    Z. Y ang, C. Wang, Z. Zhang, J. Li, Random Barzilai–Borwe in step size for mini-batch algorithms, Engineering Applications of Artificial Intelligence 72 (2018) 124–135

  62. [62]

    Bordes, L

    A. Bordes, L. Bottou, P . Gallinari, SGD-QN: Careful qua si-newton stochastic gradient descent, Journal of Machine Learning R esearch 10 (Jul) (2009) 1737–1754

  63. [63]

    R. H. Byrd, G. M. Chin, J. Nocedal, Y . Wu, Sample size sele ction in optimization methods for machine learning, Mathematical p rogramming 134 (1) (2012) 127–155

  64. [64]

    R. H. Byrd, S. L. Hansen, J. Nocedal, Y . Singer, A stochas tic quasi- newton method for large-scale optimization, SIAM Journal o n Opti- mization 26 (2) (2016) 1008–1031

  65. [65]

    Agarwal, B

    N. Agarwal, B. Bullins, E. Hazan, Second-order stochas tic optimization for machine learning in linear time, The Journal of Machine L earning Research 18 (1) (2017) 4148–4187

  66. [66]

    Tripuraneni, M

    N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan , Stochastic cubic regularization for fast nonconvex optimization, in: Advan ces in Neural Information Processing Systems, 2018, pp. 2899–2908

  67. [67]

    Schmidt, R

    M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Cl ifton, A. Sarkar, Non-uniform stochastic average gradient method for training conditional random fields., in: AISTA TS, 2015