pith. sign in

arxiv: 2606.32005 · v1 · pith:OCTS5W6Tnew · submitted 2026-06-30 · 🧮 math.OC · cs.LG· stat.ML

Random Reshuffling Dominates Stochastic Gradient Descent

Pith reviewed 2026-07-01 03:47 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords random reshufflingstochastic gradient descentsmooth convex optimizationfinite-sum problemsconvergence ratesshuffling SGD
0
0 comments X

The pith

Random reshuffling dominates stochastic gradient descent in smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that random reshuffling (RR), which randomly permutes the data once per epoch before taking gradients, achieves strictly better convergence guarantees than standard stochastic gradient descent (SGD) for smooth convex problems written as a finite sum. Earlier analyses required the stepsize to be at most order 1/n and the number of epochs to exceed order n before RR could be shown superior; the new result removes both restrictions. A sympathetic reader would therefore expect RR to be the default choice in practice whenever the objective meets the smoothness and convexity conditions, with no theoretical penalty for using larger stepsizes or running only a few passes over the data.

Core claim

For the first time, random reshuffling dominates SGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs.

What carries the argument

The new comparison argument that directly bounds the expected suboptimality of RR against that of SGD for arbitrary reasonable stepsizes, without invoking the prior 1/n threshold.

If this is right

  • RR's optimally tuned rate is no longer strictly worse than SGD's when the number of epochs is smaller than order n.
  • Theoretical support now exists for the stepsize regimes and short training runs that are common in practice.
  • Shuffling SGD under RR can be used without the previous theoretical caveats that limited its applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dominance relation may hold for other permutation-based variants of SGD, though the paper does not prove it.
  • If the result extends to non-convex problems, it would further narrow the gap between theory and the widespread use of shuffling in deep learning.
  • The proof technique might be adapted to compare RR against other variance-reduced methods under similar relaxed stepsize conditions.

Load-bearing premise

The objective must be a finite sum of smooth convex component functions, and the stepsize must satisfy the paper's definition of 'reasonable' that drops the old 1/n upper bound.

What would settle it

A concrete smooth convex finite-sum instance together with a stepsize larger than order 1/n where, after a small number of epochs, the expected error of RR exceeds that of SGD.

read the original abstract

Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite its strong performance, $\textsf{RR}$ has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for $\textsf{RR}$, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of $\textsf{RR}$ remain to this day. More precisely, according to the current theory, $\textsf{Shuffling SGD}$ under $\textsf{RR}$ converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of $\textsf{Shuffling SGD}$ under $\textsf{RR}$ is strictly worse than that of $\textsf{SGD}$ when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that $\textsf{RR}$ dominates $\textsf{SGD}$ in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proves that random reshuffling (RR) dominates stochastic gradient descent (SGD) for smooth convex finite-sum optimization under any reasonable stepsize after any finite number of epochs. It removes the prior O(1/n) stepsize restriction and the requirement that the number of epochs exceed a threshold proportional to n for the theoretical rate of RR to match or exceed that of SGD.

Significance. If the main theorem holds, the result supplies the first proof that RR is strictly superior to SGD without artificial stepsize upper bounds, directly addressing a decade-old gap between theory and the observed practical performance of shuffling-based methods. The removal of the 1/n restriction and the finite-epoch guarantee constitute a substantive advance for the analysis of variance-reduced and shuffling SGD variants.

minor comments (3)
  1. §2, Definition 2.1: the phrase 'reasonable stepsize' is introduced without an explicit inequality; a displayed condition (e.g., η ≤ c / L for some absolute c) would improve readability.
  2. Theorem 3.1: the statement claims dominance 'after any finite number of epochs'; the proof sketch in §4 should explicitly verify the base case k=1 to avoid any ambiguity about the induction start.
  3. Figure 1: the y-axis label 'excess risk' is not defined in the caption; add a parenthetical reference to the quantity appearing in Eq. (8).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive report and the recommendation to accept. The summary accurately captures the contributions of the work.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript presents a theoretical proof that RR dominates SGD for smooth convex finite-sum problems under a redefined 'reasonable stepsize' after any finite number of epochs. The abstract and available description frame this as resolving prior limitations (stepsize bound proportional to 1/n) via new analysis, without any quoted equations, fitted parameters, or self-citations that reduce the main claim to its inputs by construction. No self-definitional, fitted-input, or load-bearing self-citation patterns are exhibited in the provided material, so the derivation chain is treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, invented entities, or non-standard axioms beyond the standard setting of smooth convex finite-sum optimization.

pith-pipeline@v0.9.1-grok · 5847 in / 965 out tokens · 46845 ms · 2026-07-01T03:47:22.404739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages

  1. [1]

    Sgd with shuffling: optimal rates without component convexity and large epoch requirements

    Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17526--17535. Curran Associates, Inc., 2020. URL https://proceedings.neurips.c...

  2. [2]

    Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478

    Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437--478. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_26. URL https://doi.org/10.1007/978-3-642-35289-8_26

  3. [3]

    Curiously fast convergence of some stochastic gradient descent algorithms

    L \'e on Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pages 2624--2633. Citeseer, 2009

  4. [4]

    Stochastic Gradient Descent Tricks, pages 421--436

    L \'e on Bottou. Stochastic Gradient Descent Tricks, pages 421--436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi:10.1007/978-3-642-35289-8_25. URL https://doi.org/10.1007/978-3-642-35289-8_25

  5. [5]

    Curtis, and Jorge Nocedal

    L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018. doi:10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173

  6. [6]

    Last iterate convergence of incremental methods as a model of forgetting

    Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods as a model of forgetting. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 102613--102647, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/fea9f93f4cec99f65a8b4d575fc353a8-Pa...

  7. [7]

    Tighter convergence bounds for shuffled sgd via primal-dual perspective

    Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Tighter convergence bounds for shuffled sgd via primal-dual perspective. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 72475--72524. Curran Associates, Inc., 2024. doi:10.52202/079017-2310...

  8. [8]

    Tighter lower bounds for shuffling SGD : Random permutations and beyond

    Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD : Random permutations and beyond. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research...

  9. [9]

    Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023

  10. [10]

    Why random reshuffling beats stochastic gradient descent

    Mert G \"u rb \"u zbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186: 0 49--84, 2021

  11. [11]

    Random shuffling beats SGD after finite epochs

    Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2624--2633. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/haochen19a.html

  12. [12]

    Decomposition into functions in the minimization problem

    Vladimir Kibardin. Decomposition into functions in the minimization problem. Automation and Remote Control, 1979, 01 1979

  13. [13]

    Benign underfitting of stochastic gradient descent

    Tomer Koren, Roi Livni, Yishay Mansour, and Uri Sherman. Benign underfitting of stochastic gradient descent. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 19605--19617. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/202...

  14. [14]

    First-order and stochastic optimization methods for machine learning

    Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020

  15. [15]

    On the last-iterate convergence of shuffling gradient methods

    Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pa...

  16. [16]

    Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization

    Zijian Liu and Zhengyuan Zhou. Improved last-iterate convergence of shuffling gradient methods for nonsmooth convex optimization. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...

  17. [17]

    Random reshuffling: Simple analysis with vast improvements

    Konstantin Mishchenko, Ahmed Khaled, and Peter Richtarik. Random reshuffling: Simple analysis with vast improvements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17309--17320. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/pap...

  18. [18]

    SGD without replacement: Sharper rates for general smooth convex functions

    Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4703--4711. PMLR, 09--15 Jun 2019. UR...

  19. [19]

    Bertsekas

    Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12 0 (1): 0 109--138, 2001. doi:10.1137/S1052623499362111. URL https://doi.org/10.1137/S1052623499362111

  20. [20]

    Lectures on convex optimization, volume 137

    Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

  21. [21]

    Nguyen, Quoc Tran-Dinh, Dzung T

    Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. Journal of Machine Learning Research, 22 0 (207): 0 1--44, 2021. URL http://jmlr.org/papers/v22/20-1238.html

  22. [22]

    Boris T. Polyak. Introduction to optimization. New York, Optimization Software, 1987

  23. [23]

    Closing the convergence gap of SGD without replacement

    Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7964--7973. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.pres...

  24. [24]

    Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022

    Shashank Rajput, Kangwook Lee, and Dimitris Papailiopoulos. Permutation-based SGD : Is random optimal? In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YiBa9HKTyXE

  25. [25]

    On Information and Sufficiency,

    Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics, 22 0 (3): 0 400 -- 407, 1951. doi:10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586

  26. [26]

    Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3250--3284. PMLR, 09--12 Jul 2020. URL https://proceedings.mlr.press/v125/safran20a.html

  27. [27]

    Random shuffling beats sgd only after many epochs on ill-conditioned problems

    Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15151--15161. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/p...

  28. [28]

    Without-replacement sampling for stochastic gradient methods

    Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf

  29. [29]

    Optimal rates for random order online optimization

    Uri Sherman, Tomer Koren, and Yishay Mansour. Optimal rates for random order online optimization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2097--2108. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/fi...

  30. [30]

    Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H. Sayed. Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67 0 (2): 0 474--489, 2019. doi:10.1109/TSP.2018.2878551