Gradient Descent's Last Iterate is Often (slightly) Suboptimal
Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3
The pith
No stepsize sequence without knowledge of the horizon can guarantee optimal last-iterate rates for gradient descent on convex functions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that it is impossible to avoid an excess poly-log factor in T for the last iterate of gradient descent when the stepsize sequence must work for every possible stopping time without prior knowledge of the total horizon. This holds even for noiseless convex Lipschitz functions, and the proof indicates that such slightly suboptimal stopping times are unavoidably common under standard schedules.
What carries the argument
An anytime last-iterate guarantee, which requires the convergence bound to hold uniformly for all possible values of the stopping time T on any convex L-Lipschitz function.
If this is right
- Common stepsize choices such as 1/sqrt(t) yield a last-iterate rate of log T over sqrt(T) rather than the optimal 1/sqrt(T).
- Tailored stepsize sequences that know T in advance can recover the optimal rate.
- The same limitation applies to stochastic gradient descent, confirming the earlier conjecture.
- Suboptimal performance at the final iterate occurs for a wide range of stopping times.
Where Pith is reading between the lines
- Algorithm designers may benefit from developing adaptive stepsize rules that estimate the horizon or switch based on observed progress.
- The result points to a fundamental limitation in online optimization settings where the total duration is not fixed in advance.
- Similar poly-log overheads might appear in other anytime algorithms for convex problems.
Load-bearing premise
The stepsize sequence must be determined without knowledge of the total number of iterations T and must ensure the error bound holds for every possible stopping time on any convex Lipschitz function.
What would settle it
An explicit construction of a stepsize sequence achieving the optimal rate of 1/sqrt(T) for the last iterate at every stopping time T, without depending on T, for all convex L-Lipschitz functions would contradict the main theorem.
read the original abstract
We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proves that for minimizing convex L-Lipschitz functions, no stepsize sequence chosen without knowledge of the horizon T can guarantee an anytime last-iterate excess of o((log T)^c / sqrt(T)) for GD or SGD; this resolves the Jain et al. (2019) conjecture for SGD and strengthens it to the deterministic GD case via an adversarial construction showing that for any fixed {η_t} there exists a function and stopping time τ with excess Ω((log τ)^c / sqrt(τ)).
Significance. If the proof holds, the result is significant for optimization theory: it demonstrates that the poly-log factor in last-iterate rates is unavoidable for any horizon-independent schedule, even without noise, and that suboptimal stopping times are common. The self-contained argument using only standard convex-analysis tools, the extension to deterministic GD, and the explicit adversarial lower-bound construction are notable strengths that directly address a prior open conjecture.
major comments (1)
- The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.
minor comments (2)
- In the abstract and introduction, the phrase 'excess poly-log factor in T' should be accompanied by the precise exponent c that the construction yields, to avoid ambiguity with the O(log T / sqrt(T)) rate already known for standard schedules.
- Notation for the stopping time τ and the sequence {η_t} is introduced without a dedicated preliminary section; a short table or paragraph collecting all symbols would improve readability for readers unfamiliar with the anytime setting.
Simulated Author's Rebuttal
We are grateful to the referee for recommending acceptance and for the constructive comment on the lower-bound construction. We address the point below.
read point-by-point responses
-
Referee: The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.
Authors: We thank the referee for this observation. Our lower bound is existential: for any fixed stepsize sequence {η_t}, there exists an L-Lipschitz convex function (the constructed piecewise-linear one) and stopping time τ such that the last-iterate excess is Ω((log τ)^c / sqrt(τ)). This is sufficient to establish that no horizon-independent schedule can guarantee the optimal rate uniformly over the entire function class. The adversarial function is convex and L-Lipschitz, defined on the real line (no bounded-domain restriction), and non-smooth, matching the theorem assumptions exactly with no additional restrictions imposed. To make this verification explicit as suggested, we will add a clarifying remark in the revised manuscript. revision: yes
Circularity Check
No circularity; impossibility proof is self-contained
full rationale
The manuscript proves an impossibility result: for any fixed stepsize sequence, there exists a convex L-Lipschitz function and stopping time such that last-iterate excess is Ω((log τ)^c / √τ). This is established via an explicit adversarial construction using only standard convex-analysis tools (e.g., properties of subgradients and Lipschitz continuity). The argument does not invoke fitted parameters, self-definitional quantities, or load-bearing self-citations; the Jain et al. conjecture is merely the statement being proved, not an input that the proof reduces to. All steps are externally verifiable from first principles without reference to the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The objective function is convex and L-Lipschitz continuous.
Reference graph
Works this paper leans on
-
[1]
Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization
Jason M Altschuler and Pablo A Parrilo. Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization. Mathematical Programming, pages 1--14, 2024
work page 2024
-
[2]
Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime. arXiv preprint arXiv:2507.11274, 2025
-
[3]
Large-scale machine learning with stochastic gradient descent
L \'e on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, pages 177--186. Springer, 2010
work page 2010
-
[4]
Last iterate convergence of incremental methods and applications in continual learning
Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. In International Conference on Learning Representations, 2025
work page 2025
-
[5]
From continual learning to sgd and back: Better rates for continual linear models
Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. In Fourth Conference on Lifelong Learning Agents-Workshop Track, 2025
work page 2025
-
[6]
Guillaume Garrigos, Daniel Cortild, Lucas Ketels, and Juan Peypouquet. Last-iterate complexity of sgd for convex and smooth stochastic problems. arXiv preprint arXiv:2507.14122, 2025
-
[7]
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016
work page 2016
-
[8]
Sgd: General analysis and improved rates
Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt \'a rik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200--5209. PMLR, 2019
work page 2019
-
[9]
Accelerated objective gap and gradient norm convergence for gradient descent via long steps
Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. INFORMS Journal on Optimization, 7 0 (2): 0 156--169, 2025
work page 2025
-
[10]
Tight analyses for non-smooth stochastic gradient descent
Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579--1613. PMLR, 2019
work page 2019
-
[11]
Introduction to online convex optimization
Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016
work page 2016
-
[12]
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15 0 (1): 0 2489--2512, 2014
work page 2014
-
[13]
Making the last iterate of sgd information theoretically optimal
Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752--1755. PMLR, 2019
work page 2019
-
[14]
Open problem: Anytime convergence rate of gradient descent
Guy Kornowski and Ohad Shamir. Open problem: Anytime convergence rate of gradient descent. In The Thirty Seventh Annual Conference on Learning Theory, pages 5335--5339. PMLR, 2024
work page 2024
-
[15]
Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012
work page Pith review arXiv 2002
-
[16]
On the last-iterate convergence of shuffling gradient methods
Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024 a
-
[17]
Revisiting the last-iterate convergence of stochastic gradient methods
Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024 b
work page 2024
-
[18]
Non-asymptotic analysis of stochastic approximation algorithms for machine learning
Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011
work page 2011
-
[19]
Robust stochastic approximation approach to stochastic programming
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 0 (4): 0 1574--1609, 2009
work page 2009
-
[20]
Problem complexity and method efficiency in optimization
Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983
work page 1983
-
[21]
The asymptotic density of sequences
Ivan Niven. The asymptotic density of sequences. Bulletin of the American Mathematical Society, 57 0 (6): 0 420--434, 1951
work page 1951
-
[22]
Acceleration of stochastic approximation by averaging
Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30 0 (4): 0 838--855, 1992
work page 1992
-
[23]
Making gradient descent optimal for strongly convex stochastic optimization
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages 1571--1578, 2012
work page 2012
-
[24]
A stochastic approximation method
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951
work page 1951
-
[25]
Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1. JMLR Workshop and Conference Proceedings, 2012
work page 2012
-
[26]
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71--79. PMLR, 2013
work page 2013
-
[27]
Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions
Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In Conference on Learning Theory, pages 2934--2992. PMLR, 2019
work page 2019
-
[28]
Last iterate convergence of sgd for least-squares in the interpolation regime
Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34: 0 21581--21591, 2021
work page 2021
-
[29]
Exact convergence rate of the last iterate in subgradient methods
Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023
-
[30]
Solving large scale linear prediction problems using stochastic gradient descent algorithms
Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004
work page 2004
-
[31]
Anytime acceleration of gradient descent
Zihan Zhang, Jason Lee, Simon Du, and Yuxin Chen. Anytime acceleration of gradient descent. In Proceedings of Thirty Eighth Conference on Learning Theory, pages 5991--6013. PMLR, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.