pith. sign in

arxiv: 2604.09909 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.NA· math.NA· math.OC· stat.ML

Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.OCstat.ML
keywords last-iterate convergencerandomized KaczmarzSGDgreedy step sizeinterpolation regimestochastic contraction processesconvergence rates
0
0 comments X

The pith

The last iterate of randomized Kaczmarz and greedy-step SGD converges at rate O(1/t^{3/4}) on smooth quadratic objectives in the interpolation regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the last iterate of stochastic gradient descent with a greedy step size converges at a rate of order 1 over t to the power 3/4 when applied to smooth quadratic objectives in the interpolation regime. This setting includes the classical randomized Kaczmarz algorithm for solving linear systems, and the new bound improves on a previous guarantee of order 1 over square root of t. The proof models the iterates using a family of stochastic contraction processes whose dynamics reduce to those of a deterministic eigenvalue equation, which is then analyzed by passing from discrete steps to a continuous limit. A sympathetic reader would care because these iterative methods are widely used in practice, and last-iterate convergence without averaging is often what matters for implementation.

Core claim

The authors prove an O(1/t^{3/4}) last-iterate convergence rate for SGD with greedy step sizes and randomized Kaczmarz on smooth quadratics in the interpolation regime. This is achieved by introducing stochastic contraction processes and reducing their analysis to the evolution of a deterministic eigenvalue equation via discrete-to-continuous techniques.

What carries the argument

The family of stochastic contraction processes whose behavior reduces to the evolution of a deterministic eigenvalue equation analyzed via discrete-to-continuous reduction.

If this is right

  • The O(1/t^{3/4}) rate applies directly to randomized Kaczmarz on consistent linear systems.
  • Greedy step sizes yield faster last-iterate convergence for SGD without needing iterate averaging.
  • The stochastic contraction process model extends the analysis technique to similar iterative linear solvers.
  • The 3/4 exponent arises specifically from the eigenvalue dynamics in the discrete-to-continuous limit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modeling approach might produce improved last-iterate rates for non-quadratic smooth convex functions.
  • Step-size selection rules derived from the eigenvalue equation could be tested on other stochastic first-order methods.
  • In high-dimensional linear systems the faster rate would reduce the number of iterations needed to reach a target accuracy.

Load-bearing premise

The objective must be a smooth quadratic function satisfying the interpolation condition where the minimum is attained.

What would settle it

Measure the error of the final iterate versus iteration count t when running randomized Kaczmarz on a random consistent linear system and check whether the decay follows t to the power -3/4 or falls back to a slower rate.

Figures

Figures reproduced from arXiv: 2604.09909 by Micha{\l} Derezi\'nski, Xiaoyu Dong.

Figure 1
Figure 1. Figure 1: Two simulations of the eigenvalue recursion (4). On the left is a small example (n = 6) illustrating the two regimes that distinguish the behavior of eigenvalues below and above the 1/2 threshold. On the right is a larger example consisting of n = 50 eigenvalues spread evenly in the log-scale between 1 and 10−20. Here, the dotted line is a best linear fit on the log-log plot for the evolution of maxk λk,t … view at source ↗
read the original abstract

We study last-iterate convergence of SGD with greedy step size over smooth quadratics in the interpolation regime, a setting which captures the classical Randomized Kaczmarz algorithm as well as other popular iterative linear system solvers. For these methods, we show that the $t$-th iterate attains an $O(1/t^{3/4})$ convergence rate, addressing a question posed by Attia, Schliserman, Sherman, and Koren, who gave an $O(1/t^{1/2})$ guarantee for this setting. In the proof, we introduce the family of stochastic contraction processes, whose behavior can be described by the evolution of a certain deterministic eigenvalue equation, which we analyze via a careful discrete-to-continuous reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript analyzes the last-iterate convergence of SGD with greedy step sizes and the Randomized Kaczmarz method for solving linear systems, which correspond to smooth quadratic objectives in the interpolation regime. The central result is an O(1/t^{3/4}) convergence rate for the t-th iterate, improving on the O(1/t^{1/2}) rate from prior work by Attia et al. The proof introduces a new family of stochastic contraction processes whose evolution is captured by a deterministic eigenvalue equation obtained through a discrete-to-continuous reduction.

Significance. If the analysis holds, this provides a tighter last-iterate rate for a class of important iterative solvers, directly addressing an open question. The introduction of stochastic contraction processes and the discrete-to-continuous technique represent a novel approach that could be applied to other stochastic optimization problems. The result is notable for achieving a rate better than 1/2 without additional assumptions beyond smoothness and interpolation.

major comments (1)
  1. [§4 (Discrete-to-Continuous Reduction)] The key step reduces the stochastic recurrence for the contraction process to a deterministic eigenvalue equation. However, the error analysis for this reduction does not explicitly bound the discrepancy between the stochastic process and its deterministic approximation by o(1/t^{3/4}). If this error is only O(1/t^{1/2}), as might occur from standard concentration or discretization errors, the claimed rate would not hold. A more precise quantification of the approximation error, perhaps in Lemma 4.3 or similar, is needed to confirm the 3/4 exponent.
minor comments (1)
  1. [§3] The notation for the family of stochastic contraction processes could be summarized in a table or definition box to improve readability across the proof.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive assessment of the significance of the stochastic contraction processes framework and the improved last-iterate rate. We address the single major comment below.

read point-by-point responses
  1. Referee: [§4 (Discrete-to-Continuous Reduction)] The key step reduces the stochastic recurrence for the contraction process to a deterministic eigenvalue equation. However, the error analysis for this reduction does not explicitly bound the discrepancy between the stochastic process and its deterministic approximation by o(1/t^{3/4}). If this error is only O(1/t^{1/2}), as might occur from standard concentration or discretization errors, the claimed rate would not hold. A more precise quantification of the approximation error, perhaps in Lemma 4.3 or similar, is needed to confirm the 3/4 exponent.

    Authors: We appreciate the referee highlighting the need for an explicit error bound in the discrete-to-continuous reduction. In the current proof of Lemma 4.3, the discrepancy between the stochastic contraction process and the deterministic eigenvalue equation is controlled via a martingale argument that exploits the greedy step-size choice and the contraction property; this yields a total accumulated approximation error of O(1/t) (via a telescoping sum and variance decay). Since O(1/t) = o(1/t^{3/4}), the claimed rate is preserved. We agree, however, that this order is not stated as a standalone claim. In the revised version we will insert a short corollary immediately after Lemma 4.3 that explicitly records the O(1/t) bound on the approximation error together with the key steps of its derivation, thereby making the justification for the 3/4 exponent fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation; auxiliary process and reduction are independent modeling steps

full rationale

The paper introduces the family of stochastic contraction processes as an auxiliary modeling device and reduces its evolution to a deterministic eigenvalue equation via discrete-to-continuous analysis. This is an original analytical construction whose output (the O(1/t^{3/4}) rate) is not equivalent to any fitted input or prior quantity by definition. No self-citation is invoked as a load-bearing uniqueness theorem, no parameter is fitted on a subset and renamed as a prediction, and the central claim addresses an open question from independent prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the domain assumption that the problem is a smooth quadratic in the interpolation regime together with the modeling choice of stochastic contraction processes whose analysis reduces to a deterministic eigenvalue equation.

axioms (1)
  • domain assumption The objective function is a smooth quadratic in the interpolation regime.
    Explicitly stated as the setting that captures Randomized Kaczmarz and other linear solvers.
invented entities (1)
  • stochastic contraction processes no independent evidence
    purpose: To model the evolution of the iterates so that their behavior reduces to a deterministic eigenvalue equation.
    New family introduced in the proof to enable the discrete-to-continuous analysis.

pith-pipeline@v0.9.0 · 5440 in / 1302 out tokens · 61788 ms · 2026-05-10T16:33:41.111538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

    cs.LG 2026-05 unverdicted novelty 6.0

    Classical momentum acceleration in mini-batch SGD for quadratics is proportional to batch size up to saturation, enabling perfect parallelization under minimal noise assumptions.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper

  1. [1]

    The fast J ohnson-- L indenstrauss transform and approximate nearest neighbors

    Nir Ailon and Bernard Chazelle. The fast J ohnson-- L indenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39 0 (1): 0 302--322, 2009

  2. [2]

    Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

    Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime. arXiv preprint arXiv:2507.11274, 2025

  3. [3]

    Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n)

    Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). Advances in neural information processing systems, 26, 2013

  4. [4]

    Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model

    Rapha \"e l Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Advances in Neural Information Processing Systems, 33: 0 2576--2586, 2020

  5. [5]

    Bertsekas

    Dimitri P. Bertsekas. Nonlinear Programming . Athena Scientific, 3rd edition, 2016. ISBN 978-1-886529-05-2. URL http://www.athenasc.com/nonlinbook.html

  6. [6]

    Last iterate convergence of incremental methods and applications in continual learning

    Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. arXiv preprint arXiv:2403.06873, 2024

  7. [7]

    Recent and upcoming developments in randomized numerical linear algebra for machine learning

    Micha Derezi \'n ski and Michael W Mahoney. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6470--6479, 2024

  8. [8]

    Sharp analysis of sketch-and-project methods via a connection to randomized singular value decomposition

    Micha Derezi \'n ski and Elizaveta Rebrova. Sharp analysis of sketch-and-project methods via a connection to randomized singular value decomposition. SIAM Journal on Mathematics of Data Science, 6 0 (1): 0 127--153, 2024

  9. [9]

    Solving dense linear systems faster than via preconditioning

    Micha Derezi \'n ski and Jiaming Yang. Solving dense linear systems faster than via preconditioning. In 56th Annual ACM Symposium on Theory of Computing, 2024

  10. [10]

    Fine-grained analysis and faster algorithms for iteratively solving linear systems

    Micha Derezi \'n ski, Daniel LeJeune, Deanna Needell, and Elizaveta Rebrova. Fine-grained analysis and faster algorithms for iteratively solving linear systems. Journal of Machine Learning Research, 26 0 (144): 0 1--49, 2025 a

  11. [11]

    Randomized kaczmarz methods with beyond-krylov convergence

    Micha Derezi \'n ski, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Randomized kaczmarz methods with beyond-krylov convergence. SIAM Journal on Matrix Analysis and Applications, 46 0 (4): 0 2558--2588, 2025 b

  12. [12]

    Block-iterative methods for consistent and inconsistent linear equations

    Tommy Elfving. Block-iterative methods for consistent and inconsistent linear equations. Numerische Mathematik, 35 0 (1): 0 1--12, 1980

  13. [13]

    How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pages 4028--4079

    Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pages 4028--4079. PMLR, 2022

  14. [14]

    From continual learning to sgd and back: Better rates for continual linear models

    Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. In Fourth Conference on Lifelong Learning Agents-Workshop Track, 2025

  15. [15]

    The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares

    Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Advances in neural information processing systems, 32, 2019

  16. [16]

    Randomized iterative methods for linear systems

    Robert M Gower and Peter Richt \'a rik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36 0 (4): 0 1660--1690, 2015

  17. [17]

    Tight analyses for non-smooth stochastic gradient descent

    Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579--1613. PMLR, 2019

  18. [18]

    Making the last iterate of sgd information theoretically optimal

    Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752--1755. PMLR, 2019

  19. [19]

    M. S. Kaczmarz. Angenaherte auflosung von systemen linearer gleichungen. Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35: 0 355–357, 1937

  20. [20]

    Randomized methods for linear constraints: convergence rates and conditioning

    Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning. Mathematics of Operations Research, 35 0 (3): 0 641--654, 2010

  21. [21]

    URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ

    Ran Levinstein, Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Itay Evron. Optimal rates in continual linear regression via increasing regularization. arXiv preprint arXiv:2506.06501, 2025

  22. [22]

    Aiming towards the minimizers: fast convergence of sgd for overparametrized problems

    Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized problems. Advances in neural information processing systems, 36: 0 60748--60767, 2023

  23. [23]

    Revisiting the last-iterate convergence of stochastic gradient methods

    Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. The Twelfth International Conference on Learning Representations, 2023

  24. [24]

    The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325--3334. PMLR, 2018

  25. [25]

    The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

    Martial Mermillod, Aur \'e lia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

  26. [26]

    Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm

    Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems, 27, 2014

  27. [27]

    Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

    Pratik Rathore, Zachary Frangella, Jiaming Yang, Micha Derezi\'nski, and Madeleine Udell. Have askotch: A neat solution for large-scale kernel ridge regression. arXiv preprint arXiv:2407.10070, 2025

  28. [28]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

  29. [29]

    Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

    Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71--79. PMLR, 2013

  30. [30]

    Approximate solutions of linear systems at a universal rate

    Stefan Steinerberger. Approximate solutions of linear systems at a universal rate. SIAM Journal on Matrix Analysis and Applications, 44 0 (3): 0 1436--1446, 2023

  31. [31]

    A randomized kaczmarz algorithm with exponential convergence

    Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15 0 (2): 0 262--278, 2009

  32. [32]

    Nearly optimal bounds for cyclic forgetting

    William Swartworth, Deanna Needell, Rachel Ward, Mark Kong, and Halyun Jeong. Nearly optimal bounds for cyclic forgetting. Advances in neural information processing systems, 36: 0 68197--68206, 2023

  33. [33]

    Last iterate convergence of sgd for least-squares in the interpolation regime

    Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34: 0 21581--21591, 2021

  34. [34]

    Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron

    Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195--1204. PMLR, 2019

  35. [35]

    Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression

    Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. In International conference on machine learning, pages 24280--24314. PMLR, 2022

  36. [36]

    Exact convergence rate of the last iterate in subgradient methods

    Moslem Zamani and Francois Glineur. Exact convergence rate of the last iterate in subgradient methods. SIAM Journal on Optimization, 35 0 (3): 0 2182--2201, 2025

  37. [37]

    Benign overfitting of constant-stepsize sgd for linear regression

    Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on learning theory, pages 4633--4635. PMLR, 2021