Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu

arxiv: 2606.32005 · v1 · pith:OCTS5W6Tnew · submitted 2026-06-30 · 🧮 math.OC · cs.LG· stat.ML

Random Reshuffling Dominates Stochastic Gradient Descent

Zijian Liu This is my paper

Pith reviewed 2026-07-01 03:47 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords random reshufflingstochastic gradient descentsmooth convex optimizationfinite-sum problemsconvergence ratesshuffling SGD

0 comments

The pith

Random reshuffling dominates stochastic gradient descent in smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that random reshuffling (RR), which randomly permutes the data once per epoch before taking gradients, achieves strictly better convergence guarantees than standard stochastic gradient descent (SGD) for smooth convex problems written as a finite sum. Earlier analyses required the stepsize to be at most order 1/n and the number of epochs to exceed order n before RR could be shown superior; the new result removes both restrictions. A sympathetic reader would therefore expect RR to be the default choice in practice whenever the objective meets the smoothness and convexity conditions, with no theoretical penalty for using larger stepsizes or running only a few passes over the data.

Core claim

For the first time, random reshuffling dominates SGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs.

What carries the argument

The new comparison argument that directly bounds the expected suboptimality of RR against that of SGD for arbitrary reasonable stepsizes, without invoking the prior 1/n threshold.

If this is right

RR's optimally tuned rate is no longer strictly worse than SGD's when the number of epochs is smaller than order n.
Theoretical support now exists for the stepsize regimes and short training runs that are common in practice.
Shuffling SGD under RR can be used without the previous theoretical caveats that limited its applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dominance relation may hold for other permutation-based variants of SGD, though the paper does not prove it.
If the result extends to non-convex problems, it would further narrow the gap between theory and the widespread use of shuffling in deep learning.
The proof technique might be adapted to compare RR against other variance-reduced methods under similar relaxed stepsize conditions.

Load-bearing premise

The objective must be a finite sum of smooth convex component functions, and the stepsize must satisfy the paper's definition of 'reasonable' that drops the old 1/n upper bound.

What would settle it

A concrete smooth convex finite-sum instance together with a stepsize larger than order 1/n where, after a small number of epochs, the expected error of RR exceeds that of SGD.

read the original abstract

Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite its strong performance, $\textsf{RR}$ has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for $\textsf{RR}$, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of $\textsf{RR}$ remain to this day. More precisely, according to the current theory, $\textsf{Shuffling SGD}$ under $\textsf{RR}$ converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of $\textsf{Shuffling SGD}$ under $\textsf{RR}$ is strictly worse than that of $\textsf{SGD}$ when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that $\textsf{RR}$ dominates $\textsf{SGD}$ in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims a clean proof that RR dominates SGD without the old stepsize and epoch restrictions, but the full argument is not visible so the result stays unverified.

read the letter

The paper's central claim is that it proves random reshuffling dominates SGD for smooth convex finite-sum problems under any reasonable stepsize after any finite number of epochs. This would remove the two main restrictions that kept earlier RR theory from matching practice.

It does a solid job stating the problem: prior bounds required stepsize below O(1/n) and more than O(n) epochs before RR could be shown better than plain SGD. Naming those exact limitations is useful.

The soft spot is obvious and large. Only the abstract is supplied. No theorem statement, no definition of reasonable stepsize, and no proof steps are available. Without those pieces it is impossible to check whether the dominance actually follows or whether the new bound hides extra assumptions. The circularity burden and any post-hoc choices also cannot be assessed.

This work is aimed at people who study convergence rates for shuffling-based SGD. A reader who needs to know whether RR is now theoretically justified in the stepsize and epoch ranges that are actually used would find it relevant if the proof holds. The question itself is real and the claim is sharp enough to be checked.

I would send the full manuscript to a serious referee. The topic matters and the stated result is specific, but the current version cannot be evaluated on its own.

Referee Report

0 major / 3 minor

Summary. The paper proves that random reshuffling (RR) dominates stochastic gradient descent (SGD) for smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs. It removes the prior O(1/n) stepsize restriction and the requirement that the number of epochs exceed a threshold proportional to n for the theoretical rate of RR to match or exceed that of SGD.

Significance. If the main theorem holds, the result supplies the first proof that RR is strictly superior to SGD without artificial stepsize upper bounds, directly addressing a decade-old gap between theory and the observed practical performance of shuffling-based methods. The removal of the 1/n restriction and the finite-epoch guarantee constitute a substantive advance for the analysis of variance-reduced and shuffling SGD variants.

minor comments (3)

§2, Definition 2.1: the phrase 'reasonable stepsize' is introduced without an explicit inequality; a displayed condition (e.g., η ≤ c / L for some absolute c) would improve readability.
Theorem 3.1: the statement claims dominance 'after any finite number of epochs'; the proof sketch in §4 should explicitly verify the base case k=1 to avoid any ambiguity about the induction start.
Figure 1: the y-axis label 'excess risk' is not defined in the caption; add a parenthetical reference to the quantity appearing in Eq. (8).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive report and the recommendation to accept. The summary accurately captures the contributions of the work.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript presents a theoretical proof that RR dominates SGD for smooth convex finite-sum problems under a redefined 'reasonable stepsize' after any finite number of epochs. The abstract and available description frame this as resolving prior limitations (stepsize bound proportional to 1/n) via new analysis, without any quoted equations, fitted parameters, or self-citations that reduce the main claim to its inputs by construction. No self-definitional, fitted-input, or load-bearing self-citation patterns are exhibited in the provided material, so the derivation chain is treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, invented entities, or non-standard axioms beyond the standard setting of smooth convex finite-sum optimization.

pith-pipeline@v0.9.1-grok · 5847 in / 965 out tokens · 46845 ms · 2026-07-01T03:47:22.404739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages

[1]

Sgd with shuffling: optimal rates without component convexity and large epoch requirements

Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17526--17535. Curran Associates, Inc., 2020. URL https://proceedings.neurips.c...

2020
[2]

Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478

Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_26. URL https://doi.org/10.1007/978-3-642-35289-8_26

work page doi:10.1007/978-3-642-35289-8_26 2012
[3]

Curiously fast convergence of some stochastic gradient descent algorithms

L \'e on Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pages 2624--2633. Citeseer, 2009

2009
[4]

Stochastic Gradient Descent Tricks, pages 421--436

L \'e on Bottou. Stochastic Gradient Descent Tricks, pages 421--436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_25. URL https://doi.org/10.1007/978-3-642-35289-8_25

work page doi:10.1007/978-3-642-35289-8_25 2012
[5]

Curtis, and Jorge Nocedal

L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018. doi:10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173

work page doi:10.1137/16m1080173 2018
[6]

Last iterate convergence of incremental methods as a model of forgetting

Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods as a model of forgetting. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 102613--102647, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/fea9f93f4cec99f65a8b4d575fc353a8-Pa...

2025
[7]

Tighter convergence bounds for shuffled sgd via primal-dual perspective

Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Tighter convergence bounds for shuffled sgd via primal-dual perspective. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 72475--72524. Curran Associates, Inc., 2024. doi:10.52202/079017-2310...

work page doi:10.52202/079017-2310 2024
[8]

Tighter lower bounds for shuffling SGD : Random permutations and beyond

Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD : Random permutations and beyond. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research...

2023
[9]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023
[10]

Why random reshuffling beats stochastic gradient descent

Mert G \"u rb \"u zbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186: 0 49--84, 2021

2021
[11]

Random shuffling beats SGD after finite epochs

Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2624--2633. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/haochen19a.html

2019
[12]

Decomposition into functions in the minimization problem

Vladimir Kibardin. Decomposition into functions in the minimization problem. Automation and Remote Control, 1979, 01 1979

1979
[13]

Benign underfitting of stochastic gradient descent

Tomer Koren, Roi Livni, Yishay Mansour, and Uri Sherman. Benign underfitting of stochastic gradient descent. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 19605--19617. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/202...

2022
[14]

First-order and stochastic optimization methods for machine learning

Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020

2020
[15]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pa...

2024
[16]

Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization

Zijian Liu and Zhengyuan Zhou. Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...

2025
[17]

Random reshuffling: Simple analysis with vast improvements

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtarik. Random reshuffling: Simple analysis with vast improvements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17309--17320. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/pap...

2020
[18]

SGD without replacement: Sharper rates for general smooth convex functions

Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4703--4711. PMLR, 09--15 Jun 2019. UR...

2019
[19]

Bertsekas

Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12 0 (1): 0 109--138, 2001. doi:10.1137/S1052623499362111. URL https://doi.org/10.1137/S1052623499362111

work page doi:10.1137/s1052623499362111 2001
[20]

Lectures on convex optimization, volume 137

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

2018
[21]

Nguyen, Quoc Tran-Dinh, Dzung T

Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. Journal of Machine Learning Research, 22 0 (207): 0 1--44, 2021. URL http://jmlr.org/papers/v22/20-1238.html

2021
[22]

Boris T. Polyak. Introduction to optimization. New York, Optimization Software, 1987

1987
[23]

Closing the convergence gap of SGD without replacement

Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7964--7973. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.pres...

2020
[24]

Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022

Shashank Rajput, Kangwook Lee, and Dimitris Papailiopoulos. Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YiBa9HKTyXE

2022
[25]

On Information and Sufficiency,

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics, 22 0 (3): 0 400 -- 407, 1951. doi:10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729586 1951
[26]

Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3250--3284. PMLR, 09--12 Jul 2020. URL https://proceedings.mlr.press/v125/safran20a.html

2020
[27]

Random shuffling beats sgd only after many epochs on ill-conditioned problems

Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15151--15161. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/p...

2021
[28]

Without-replacement sampling for stochastic gradient methods

Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf

2016
[29]

Optimal rates for random order online optimization

Uri Sherman, Tomer Koren, and Yishay Mansour. Optimal rates for random order online optimization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2097--2108. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/fi...

2097
[30]

Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H. Sayed. Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67 0 (2): 0 474--489, 2019. doi:10.1109/TSP.2018.2878551

work page doi:10.1109/tsp.2018.2878551 2019

[1] [1]

Sgd with shuffling: optimal rates without component convexity and large epoch requirements

Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17526--17535. Curran Associates, Inc., 2020. URL https://proceedings.neurips.c...

2020

[2] [2]

Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478

Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_26. URL https://doi.org/10.1007/978-3-642-35289-8_26

work page doi:10.1007/978-3-642-35289-8_26 2012

[3] [3]

Curiously fast convergence of some stochastic gradient descent algorithms

L \'e on Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pages 2624--2633. Citeseer, 2009

2009

[4] [4]

Stochastic Gradient Descent Tricks, pages 421--436

L \'e on Bottou. Stochastic Gradient Descent Tricks, pages 421--436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_25. URL https://doi.org/10.1007/978-3-642-35289-8_25

work page doi:10.1007/978-3-642-35289-8_25 2012

[5] [5]

Curtis, and Jorge Nocedal

L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018. doi:10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173

work page doi:10.1137/16m1080173 2018

[6] [6]

Last iterate convergence of incremental methods as a model of forgetting

Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods as a model of forgetting. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 102613--102647, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/fea9f93f4cec99f65a8b4d575fc353a8-Pa...

2025

[7] [7]

Tighter convergence bounds for shuffled sgd via primal-dual perspective

Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Tighter convergence bounds for shuffled sgd via primal-dual perspective. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 72475--72524. Curran Associates, Inc., 2024. doi:10.52202/079017-2310...

work page doi:10.52202/079017-2310 2024

[8] [8]

Tighter lower bounds for shuffling SGD : Random permutations and beyond

Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD : Random permutations and beyond. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research...

2023

[9] [9]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023

[10] [10]

Why random reshuffling beats stochastic gradient descent

Mert G \"u rb \"u zbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186: 0 49--84, 2021

2021

[11] [11]

Random shuffling beats SGD after finite epochs

Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2624--2633. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/haochen19a.html

2019

[12] [12]

Decomposition into functions in the minimization problem

Vladimir Kibardin. Decomposition into functions in the minimization problem. Automation and Remote Control, 1979, 01 1979

1979

[13] [13]

Benign underfitting of stochastic gradient descent

Tomer Koren, Roi Livni, Yishay Mansour, and Uri Sherman. Benign underfitting of stochastic gradient descent. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 19605--19617. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/202...

2022

[14] [14]

First-order and stochastic optimization methods for machine learning

Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020

2020

[15] [15]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pa...

2024

[16] [16]

Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization

Zijian Liu and Zhengyuan Zhou. Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...

2025

[17] [17]

Random reshuffling: Simple analysis with vast improvements

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtarik. Random reshuffling: Simple analysis with vast improvements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17309--17320. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/pap...

2020

[18] [18]

SGD without replacement: Sharper rates for general smooth convex functions

Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4703--4711. PMLR, 09--15 Jun 2019. UR...

2019

[19] [19]

Bertsekas

Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12 0 (1): 0 109--138, 2001. doi:10.1137/S1052623499362111. URL https://doi.org/10.1137/S1052623499362111

work page doi:10.1137/s1052623499362111 2001

[20] [20]

Lectures on convex optimization, volume 137

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

2018

[21] [21]

Nguyen, Quoc Tran-Dinh, Dzung T

Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. Journal of Machine Learning Research, 22 0 (207): 0 1--44, 2021. URL http://jmlr.org/papers/v22/20-1238.html

2021

[22] [22]

Boris T. Polyak. Introduction to optimization. New York, Optimization Software, 1987

1987

[23] [23]

Closing the convergence gap of SGD without replacement

Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7964--7973. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.pres...

2020

[24] [24]

Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022

Shashank Rajput, Kangwook Lee, and Dimitris Papailiopoulos. Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YiBa9HKTyXE

2022

[25] [25]

On Information and Sufficiency,

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics, 22 0 (3): 0 400 -- 407, 1951. doi:10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729586 1951

[26] [26]

Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3250--3284. PMLR, 09--12 Jul 2020. URL https://proceedings.mlr.press/v125/safran20a.html

2020

[27] [27]

Random shuffling beats sgd only after many epochs on ill-conditioned problems

Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15151--15161. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/p...

2021

[28] [28]

Without-replacement sampling for stochastic gradient methods

Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf

2016

[29] [29]

Optimal rates for random order online optimization

Uri Sherman, Tomer Koren, and Yishay Mansour. Optimal rates for random order online optimization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2097--2108. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/fi...

2097

[30] [30]

Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H. Sayed. Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67 0 (2): 0 474--489, 2019. doi:10.1109/TSP.2018.2878551

work page doi:10.1109/tsp.2018.2878551 2019