Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Guy Kornowski; Ohad Shamir

arxiv: 2604.13870 · v1 · submitted 2026-04-15 · 🧮 math.OC · cs.LG

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Guy Kornowski , Ohad Shamir This is my paper

Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords gradient descentlast iterateconvex optimizationconvergence ratesstepsize sequencesanytime guaranteesstochastic gradient descentpoly-log factors

0 comments

The pith

No stepsize sequence without knowledge of the horizon can guarantee optimal last-iterate rates for gradient descent on convex functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that achieving the best possible last-iterate convergence rate of 1 over square root of T for gradient descent requires knowing the total number of steps T ahead of time. Without this knowledge, any fixed stepsize sequence will suffer at least a poly-logarithmic factor worse error when stopped at an arbitrary time. The authors prove this for the deterministic case and thereby resolve a prior conjecture for the stochastic setting. This matters because most practical implementations use stepsize rules that do not depend on a predetermined T and must perform well no matter when optimization is halted.

Core claim

We prove that it is impossible to avoid an excess poly-log factor in T for the last iterate of gradient descent when the stepsize sequence must work for every possible stopping time without prior knowledge of the total horizon. This holds even for noiseless convex Lipschitz functions, and the proof indicates that such slightly suboptimal stopping times are unavoidably common under standard schedules.

What carries the argument

An anytime last-iterate guarantee, which requires the convergence bound to hold uniformly for all possible values of the stopping time T on any convex L-Lipschitz function.

If this is right

Common stepsize choices such as 1/sqrt(t) yield a last-iterate rate of log T over sqrt(T) rather than the optimal 1/sqrt(T).
Tailored stepsize sequences that know T in advance can recover the optimal rate.
The same limitation applies to stochastic gradient descent, confirming the earlier conjecture.
Suboptimal performance at the final iterate occurs for a wide range of stopping times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithm designers may benefit from developing adaptive stepsize rules that estimate the horizon or switch based on observed progress.
The result points to a fundamental limitation in online optimization settings where the total duration is not fixed in advance.
Similar poly-log overheads might appear in other anytime algorithms for convex problems.

Load-bearing premise

The stepsize sequence must be determined without knowledge of the total number of iterations T and must ensure the error bound holds for every possible stopping time on any convex Lipschitz function.

What would settle it

An explicit construction of a stepsize sequence achieving the optimal rate of 1/sqrt(T) for the last iterate at every stopping time T, without depending on T, for all convex L-Lipschitz functions would contradict the main theorem.

read the original abstract

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This settles the Jain conjecture with a clean adversarial lower bound that rules out optimal anytime last-iterate rates for both SGD and GD.

read the letter

The one or two things to know: this paper proves the 2019 conjecture of Jain et al. that no stepsize sequence for SGD can guarantee optimal last-iterate convergence without knowing the total number of steps T ahead of time. It also shows that deterministic gradient descent faces the same limitation. The authors establish this via an adversarial construction. For any predetermined sequence of stepsizes, they construct a convex L-Lipschitz function and a stopping time where the last iterate excess is at least on the order of (log τ) to some positive power over sqrt(τ). The argument relies only on standard convex analysis and is self-contained. This is a solid contribution because it makes precise why the usual log T / sqrt(T) rate appears and why you cannot avoid the extra factor in an anytime setting. The construction supports the claim that such slightly suboptimal stopping times are common rather than exceptional. The proof looks careful and does not introduce hidden assumptions or circular reasoning. It directly strengthens the original conjecture by covering the noiseless GD case as well. One soft spot is that the exact power of the logarithm is not optimized in the bound, but this does not affect the main impossibility result. The paper focuses on existence of some poly-log excess, which is sufficient. This work is for optimization researchers interested in last-iterate rates and anytime algorithms. It provides a clear negative result that helps set expectations for what GD and SGD can achieve. I would recommend sending it for peer review. The result is important enough and the argument is grounded enough to deserve a serious referee.

Referee Report

1 major / 2 minor

Summary. The manuscript proves that for minimizing convex L-Lipschitz functions, no stepsize sequence chosen without knowledge of the horizon T can guarantee an anytime last-iterate excess of o((log T)^c / sqrt(T)) for GD or SGD; this resolves the Jain et al. (2019) conjecture for SGD and strengthens it to the deterministic GD case via an adversarial construction showing that for any fixed {η_t} there exists a function and stopping time τ with excess Ω((log τ)^c / sqrt(τ)).

Significance. If the proof holds, the result is significant for optimization theory: it demonstrates that the poly-log factor in last-iterate rates is unavoidable for any horizon-independent schedule, even without noise, and that suboptimal stopping times are common. The self-contained argument using only standard convex-analysis tools, the extension to deterministic GD, and the explicit adversarial lower-bound construction are notable strengths that directly address a prior open conjecture.

major comments (1)

The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.

minor comments (2)

In the abstract and introduction, the phrase 'excess poly-log factor in T' should be accompanied by the precise exponent c that the construction yields, to avoid ambiguity with the O(log T / sqrt(T)) rate already known for standard schedules.
Notation for the stopping time τ and the sequence {η_t} is introduced without a dedicated preliminary section; a short table or paragraph collecting all symbols would improve readability for readers unfamiliar with the anytime setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for recommending acceptance and for the constructive comment on the lower-bound construction. We address the point below.

read point-by-point responses

Referee: The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.

Authors: We thank the referee for this observation. Our lower bound is existential: for any fixed stepsize sequence {η_t}, there exists an L-Lipschitz convex function (the constructed piecewise-linear one) and stopping time τ such that the last-iterate excess is Ω((log τ)^c / sqrt(τ)). This is sufficient to establish that no horizon-independent schedule can guarantee the optimal rate uniformly over the entire function class. The adversarial function is convex and L-Lipschitz, defined on the real line (no bounded-domain restriction), and non-smooth, matching the theorem assumptions exactly with no additional restrictions imposed. To make this verification explicit as suggested, we will add a clarifying remark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; impossibility proof is self-contained

full rationale

The manuscript proves an impossibility result: for any fixed stepsize sequence, there exists a convex L-Lipschitz function and stopping time such that last-iterate excess is Ω((log τ)^c / √τ). This is established via an explicit adversarial construction using only standard convex-analysis tools (e.g., properties of subgradients and Lipschitz continuity). The argument does not invoke fitted parameters, self-definitional quantities, or load-bearing self-citations; the Jain et al. conjecture is merely the statement being proved, not an input that the proof reduces to. All steps are externally verifiable from first principles without reference to the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on the standard domain assumptions of convexity and Lipschitz continuity together with the definition of last-iterate convergence; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The objective function is convex and L-Lipschitz continuous.
Standard assumption invoked for GD/SGD convergence analysis in the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1186 out tokens · 29819 ms · 2026-05-10T12:22:59.077780+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization

Jason M Altschuler and Pablo A Parrilo. Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization. Mathematical Programming, pages 1--14, 2024

work page 2024
[2]

Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime. arXiv preprint arXiv:2507.11274, 2025

work page arXiv 2025
[3]

Large-scale machine learning with stochastic gradient descent

L \'e on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, pages 177--186. Springer, 2010

work page 2010
[4]

Last iterate convergence of incremental methods and applications in continual learning

Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. In International Conference on Learning Representations, 2025

work page 2025
[5]

From continual learning to sgd and back: Better rates for continual linear models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. In Fourth Conference on Lifelong Learning Agents-Workshop Track, 2025

work page 2025
[6]

Garrigos, D

Guillaume Garrigos, Daniel Cortild, Lucas Ketels, and Juan Peypouquet. Last-iterate complexity of sgd for convex and smooth stochastic problems. arXiv preprint arXiv:2507.14122, 2025

work page arXiv 2025
[7]

Deep learning, volume 1

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016

work page 2016
[8]

Sgd: General analysis and improved rates

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt \'a rik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200--5209. PMLR, 2019

work page 2019
[9]

Accelerated objective gap and gradient norm convergence for gradient descent via long steps

Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. INFORMS Journal on Optimization, 7 0 (2): 0 156--169, 2025

work page 2025
[10]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579--1613. PMLR, 2019

work page 2019
[11]

Introduction to online convex optimization

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

work page 2016
[12]

Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15 0 (1): 0 2489--2512, 2014

work page 2014
[13]

Making the last iterate of sgd information theoretically optimal

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752--1755. PMLR, 2019

work page 2019
[14]

Open problem: Anytime convergence rate of gradient descent

Guy Kornowski and Ohad Shamir. Open problem: Anytime convergence rate of gradient descent. In The Thirty Seventh Annual Conference on Learning Theory, pages 5335--5339. PMLR, 2024

work page 2024
[15]

A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012

work page Pith review arXiv 2002
[16]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024 a

work page arXiv 2024
[17]

Revisiting the last-iterate convergence of stochastic gradient methods

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024 b

work page 2024
[18]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011

work page 2011
[19]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 0 (4): 0 1574--1609, 2009

work page 2009
[20]

Problem complexity and method efficiency in optimization

Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983

work page 1983
[21]

The asymptotic density of sequences

Ivan Niven. The asymptotic density of sequences. Bulletin of the American Mathematical Society, 57 0 (6): 0 420--434, 1951

work page 1951
[22]

Acceleration of stochastic approximation by averaging

Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30 0 (4): 0 838--855, 1992

work page 1992
[23]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages 1571--1578, 2012

work page 2012
[24]

A stochastic approximation method

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

work page 1951
[25]

Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1

Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1. JMLR Workshop and Conference Proceedings, 2012

work page 2012
[26]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71--79. PMLR, 2013

work page 2013
[27]

Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions

Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In Conference on Learning Theory, pages 2934--2992. PMLR, 2019

work page 2019
[28]

Last iterate convergence of sgd for least-squares in the interpolation regime

Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34: 0 21581--21591, 2021

work page 2021
[29]

Exact convergence rate of the last iterate in subgradient methods

Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023

work page arXiv 2023
[30]

Solving large scale linear prediction problems using stochastic gradient descent algorithms

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004

work page 2004
[31]

Anytime acceleration of gradient descent

Zihan Zhang, Jason Lee, Simon Du, and Yuxin Chen. Anytime acceleration of gradient descent. In Proceedings of Thirty Eighth Conference on Learning Theory, pages 5991--6013. PMLR, 2025

work page 2025

[1] [1]

Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization

Jason M Altschuler and Pablo A Parrilo. Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization. Mathematical Programming, pages 1--14, 2024

work page 2024

[2] [2]

Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime. arXiv preprint arXiv:2507.11274, 2025

work page arXiv 2025

[3] [3]

Large-scale machine learning with stochastic gradient descent

L \'e on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, pages 177--186. Springer, 2010

work page 2010

[4] [4]

Last iterate convergence of incremental methods and applications in continual learning

Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. In International Conference on Learning Representations, 2025

work page 2025

[5] [5]

From continual learning to sgd and back: Better rates for continual linear models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. In Fourth Conference on Lifelong Learning Agents-Workshop Track, 2025

work page 2025

[6] [6]

Garrigos, D

Guillaume Garrigos, Daniel Cortild, Lucas Ketels, and Juan Peypouquet. Last-iterate complexity of sgd for convex and smooth stochastic problems. arXiv preprint arXiv:2507.14122, 2025

work page arXiv 2025

[7] [7]

Deep learning, volume 1

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016

work page 2016

[8] [8]

Sgd: General analysis and improved rates

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt \'a rik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200--5209. PMLR, 2019

work page 2019

[9] [9]

Accelerated objective gap and gradient norm convergence for gradient descent via long steps

Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. INFORMS Journal on Optimization, 7 0 (2): 0 156--169, 2025

work page 2025

[10] [10]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579--1613. PMLR, 2019

work page 2019

[11] [11]

Introduction to online convex optimization

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

work page 2016

[12] [12]

Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15 0 (1): 0 2489--2512, 2014

work page 2014

[13] [13]

Making the last iterate of sgd information theoretically optimal

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752--1755. PMLR, 2019

work page 2019

[14] [14]

Open problem: Anytime convergence rate of gradient descent

Guy Kornowski and Ohad Shamir. Open problem: Anytime convergence rate of gradient descent. In The Thirty Seventh Annual Conference on Learning Theory, pages 5335--5339. PMLR, 2024

work page 2024

[15] [15]

A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012

work page Pith review arXiv 2002

[16] [16]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024 a

work page arXiv 2024

[17] [17]

Revisiting the last-iterate convergence of stochastic gradient methods

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024 b

work page 2024

[18] [18]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011

work page 2011

[19] [19]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 0 (4): 0 1574--1609, 2009

work page 2009

[20] [20]

Problem complexity and method efficiency in optimization

Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983

work page 1983

[21] [21]

The asymptotic density of sequences

Ivan Niven. The asymptotic density of sequences. Bulletin of the American Mathematical Society, 57 0 (6): 0 420--434, 1951

work page 1951

[22] [22]

Acceleration of stochastic approximation by averaging

Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30 0 (4): 0 838--855, 1992

work page 1992

[23] [23]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages 1571--1578, 2012

work page 2012

[24] [24]

A stochastic approximation method

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

work page 1951

[25] [25]

Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1

Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1. JMLR Workshop and Conference Proceedings, 2012

work page 2012

[26] [26]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71--79. PMLR, 2013

work page 2013

[27] [27]

Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions

Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In Conference on Learning Theory, pages 2934--2992. PMLR, 2019

work page 2019

[28] [28]

Last iterate convergence of sgd for least-squares in the interpolation regime

Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34: 0 21581--21591, 2021

work page 2021

[29] [29]

Exact convergence rate of the last iterate in subgradient methods

Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023

work page arXiv 2023

[30] [30]

Solving large scale linear prediction problems using stochastic gradient descent algorithms

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004

work page 2004

[31] [31]

Anytime acceleration of gradient descent

Zihan Zhang, Jason Lee, Simon Du, and Yuxin Chen. Anytime acceleration of gradient descent. In Proceedings of Thirty Eighth Conference on Learning Theory, pages 5991--6013. PMLR, 2025

work page 2025