Random Reshuffling Dominates Stochastic Gradient Descent
Pith reviewed 2026-07-01 03:47 UTC · model grok-4.3
The pith
Random reshuffling dominates stochastic gradient descent in smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the first time, random reshuffling dominates SGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs.
What carries the argument
The new comparison argument that directly bounds the expected suboptimality of RR against that of SGD for arbitrary reasonable stepsizes, without invoking the prior 1/n threshold.
If this is right
- RR's optimally tuned rate is no longer strictly worse than SGD's when the number of epochs is smaller than order n.
- Theoretical support now exists for the stepsize regimes and short training runs that are common in practice.
- Shuffling SGD under RR can be used without the previous theoretical caveats that limited its applicability.
Where Pith is reading between the lines
- The same dominance relation may hold for other permutation-based variants of SGD, though the paper does not prove it.
- If the result extends to non-convex problems, it would further narrow the gap between theory and the widespread use of shuffling in deep learning.
- The proof technique might be adapted to compare RR against other variance-reduced methods under similar relaxed stepsize conditions.
Load-bearing premise
The objective must be a finite sum of smooth convex component functions, and the stepsize must satisfy the paper's definition of 'reasonable' that drops the old 1/n upper bound.
What would settle it
A concrete smooth convex finite-sum instance together with a stepsize larger than order 1/n where, after a small number of epochs, the expected error of RR exceeds that of SGD.
read the original abstract
Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite its strong performance, $\textsf{RR}$ has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for $\textsf{RR}$, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of $\textsf{RR}$ remain to this day. More precisely, according to the current theory, $\textsf{Shuffling SGD}$ under $\textsf{RR}$ converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of $\textsf{Shuffling SGD}$ under $\textsf{RR}$ is strictly worse than that of $\textsf{SGD}$ when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that $\textsf{RR}$ dominates $\textsf{SGD}$ in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that random reshuffling (RR) dominates stochastic gradient descent (SGD) for smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs. It removes the prior O(1/n) stepsize restriction and the requirement that the number of epochs exceed a threshold proportional to n for the theoretical rate of RR to match or exceed that of SGD.
Significance. If the main theorem holds, the result supplies the first proof that RR is strictly superior to SGD without artificial stepsize upper bounds, directly addressing a decade-old gap between theory and the observed practical performance of shuffling-based methods. The removal of the 1/n restriction and the finite-epoch guarantee constitute a substantive advance for the analysis of variance-reduced and shuffling SGD variants.
minor comments (3)
- §2, Definition 2.1: the phrase 'reasonable stepsize' is introduced without an explicit inequality; a displayed condition (e.g., η ≤ c / L for some absolute c) would improve readability.
- Theorem 3.1: the statement claims dominance 'after any finite number of epochs'; the proof sketch in §4 should explicitly verify the base case k=1 to avoid any ambiguity about the induction start.
- Figure 1: the y-axis label 'excess risk' is not defined in the caption; add a parenthetical reference to the quantity appearing in Eq. (8).
Simulated Author's Rebuttal
We thank the referee for the positive report and the recommendation to accept. The summary accurately captures the contributions of the work.
Circularity Check
No significant circularity identified
full rationale
The manuscript presents a theoretical proof that RR dominates SGD for smooth convex finite-sum problems under a redefined 'reasonable stepsize' after any finite number of epochs. The abstract and available description frame this as resolving prior limitations (stepsize bound proportional to 1/n) via new analysis, without any quoted equations, fitted parameters, or self-citations that reduce the main claim to its inputs by construction. No self-definitional, fitted-input, or load-bearing self-citation patterns are exhibited in the provided material, so the derivation chain is treated as self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sgd with shuffling: optimal rates without component convexity and large epoch requirements
Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17526--17535. Curran Associates, Inc., 2020. URL https://proceedings.neurips.c...
2020
-
[2]
Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478
Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_26. URL https://doi.org/10.1007/978-3-642-35289-8_26
-
[3]
Curiously fast convergence of some stochastic gradient descent algorithms
L \'e on Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pages 2624--2633. Citeseer, 2009
2009
-
[4]
Stochastic Gradient Descent Tricks, pages 421--436
L \'e on Bottou. Stochastic Gradient Descent Tricks, pages 421--436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_25. URL https://doi.org/10.1007/978-3-642-35289-8_25
-
[5]
L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018. doi:10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173
-
[6]
Last iterate convergence of incremental methods as a model of forgetting
Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods as a model of forgetting. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 102613--102647, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/fea9f93f4cec99f65a8b4d575fc353a8-Pa...
2025
-
[7]
Tighter convergence bounds for shuffled sgd via primal-dual perspective
Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Tighter convergence bounds for shuffled sgd via primal-dual perspective. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 72475--72524. Curran Associates, Inc., 2024. doi:10.52202/079017-2310...
-
[8]
Tighter lower bounds for shuffling SGD : Random permutations and beyond
Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD : Random permutations and beyond. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research...
2023
-
[9]
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023
-
[10]
Why random reshuffling beats stochastic gradient descent
Mert G \"u rb \"u zbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186: 0 49--84, 2021
2021
-
[11]
Random shuffling beats SGD after finite epochs
Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2624--2633. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/haochen19a.html
2019
-
[12]
Decomposition into functions in the minimization problem
Vladimir Kibardin. Decomposition into functions in the minimization problem. Automation and Remote Control, 1979, 01 1979
1979
-
[13]
Benign underfitting of stochastic gradient descent
Tomer Koren, Roi Livni, Yishay Mansour, and Uri Sherman. Benign underfitting of stochastic gradient descent. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 19605--19617. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/202...
2022
-
[14]
First-order and stochastic optimization methods for machine learning
Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020
2020
-
[15]
On the last-iterate convergence of shuffling gradient methods
Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pa...
2024
-
[16]
Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization
Zijian Liu and Zhengyuan Zhou. Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...
2025
-
[17]
Random reshuffling: Simple analysis with vast improvements
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtarik. Random reshuffling: Simple analysis with vast improvements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17309--17320. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/pap...
2020
-
[18]
SGD without replacement: Sharper rates for general smooth convex functions
Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4703--4711. PMLR, 09--15 Jun 2019. UR...
2019
-
[19]
Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12 0 (1): 0 109--138, 2001. doi:10.1137/S1052623499362111. URL https://doi.org/10.1137/S1052623499362111
-
[20]
Lectures on convex optimization, volume 137
Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018
2018
-
[21]
Nguyen, Quoc Tran-Dinh, Dzung T
Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. Journal of Machine Learning Research, 22 0 (207): 0 1--44, 2021. URL http://jmlr.org/papers/v22/20-1238.html
2021
-
[22]
Boris T. Polyak. Introduction to optimization. New York, Optimization Software, 1987
1987
-
[23]
Closing the convergence gap of SGD without replacement
Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7964--7973. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.pres...
2020
-
[24]
Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022
Shashank Rajput, Kangwook Lee, and Dimitris Papailiopoulos. Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YiBa9HKTyXE
2022
-
[25]
On Information and Sufficiency,
Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics, 22 0 (3): 0 400 -- 407, 1951. doi:10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586
-
[26]
Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3250--3284. PMLR, 09--12 Jul 2020. URL https://proceedings.mlr.press/v125/safran20a.html
2020
-
[27]
Random shuffling beats sgd only after many epochs on ill-conditioned problems
Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15151--15161. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/p...
2021
-
[28]
Without-replacement sampling for stochastic gradient methods
Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf
2016
-
[29]
Optimal rates for random order online optimization
Uri Sherman, Tomer Koren, and Yishay Mansour. Optimal rates for random order online optimization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2097--2108. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/fi...
2097
-
[30]
Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H. Sayed. Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67 0 (2): 0 474--489, 2019. doi:10.1109/TSP.2018.2878551
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.