pith. sign in

arxiv: 2604.13870 · v1 · submitted 2026-04-15 · 🧮 math.OC · cs.LG

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords gradient descentlast iterateconvex optimizationconvergence ratesstepsize sequencesanytime guaranteesstochastic gradient descentpoly-log factors
0
0 comments X

The pith

No stepsize sequence without knowledge of the horizon can guarantee optimal last-iterate rates for gradient descent on convex functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that achieving the best possible last-iterate convergence rate of 1 over square root of T for gradient descent requires knowing the total number of steps T ahead of time. Without this knowledge, any fixed stepsize sequence will suffer at least a poly-logarithmic factor worse error when stopped at an arbitrary time. The authors prove this for the deterministic case and thereby resolve a prior conjecture for the stochastic setting. This matters because most practical implementations use stepsize rules that do not depend on a predetermined T and must perform well no matter when optimization is halted.

Core claim

We prove that it is impossible to avoid an excess poly-log factor in T for the last iterate of gradient descent when the stepsize sequence must work for every possible stopping time without prior knowledge of the total horizon. This holds even for noiseless convex Lipschitz functions, and the proof indicates that such slightly suboptimal stopping times are unavoidably common under standard schedules.

What carries the argument

An anytime last-iterate guarantee, which requires the convergence bound to hold uniformly for all possible values of the stopping time T on any convex L-Lipschitz function.

If this is right

  • Common stepsize choices such as 1/sqrt(t) yield a last-iterate rate of log T over sqrt(T) rather than the optimal 1/sqrt(T).
  • Tailored stepsize sequences that know T in advance can recover the optimal rate.
  • The same limitation applies to stochastic gradient descent, confirming the earlier conjecture.
  • Suboptimal performance at the final iterate occurs for a wide range of stopping times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithm designers may benefit from developing adaptive stepsize rules that estimate the horizon or switch based on observed progress.
  • The result points to a fundamental limitation in online optimization settings where the total duration is not fixed in advance.
  • Similar poly-log overheads might appear in other anytime algorithms for convex problems.

Load-bearing premise

The stepsize sequence must be determined without knowledge of the total number of iterations T and must ensure the error bound holds for every possible stopping time on any convex Lipschitz function.

What would settle it

An explicit construction of a stepsize sequence achieving the optimal rate of 1/sqrt(T) for the last iterate at every stopping time T, without depending on T, for all convex L-Lipschitz functions would contradict the main theorem.

read the original abstract

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proves that for minimizing convex L-Lipschitz functions, no stepsize sequence chosen without knowledge of the horizon T can guarantee an anytime last-iterate excess of o((log T)^c / sqrt(T)) for GD or SGD; this resolves the Jain et al. (2019) conjecture for SGD and strengthens it to the deterministic GD case via an adversarial construction showing that for any fixed {η_t} there exists a function and stopping time τ with excess Ω((log τ)^c / sqrt(τ)).

Significance. If the proof holds, the result is significant for optimization theory: it demonstrates that the poly-log factor in last-iterate rates is unavoidable for any horizon-independent schedule, even without noise, and that suboptimal stopping times are common. The self-contained argument using only standard convex-analysis tools, the extension to deterministic GD, and the explicit adversarial lower-bound construction are notable strengths that directly address a prior open conjecture.

major comments (1)
  1. The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.
minor comments (2)
  1. In the abstract and introduction, the phrase 'excess poly-log factor in T' should be accompanied by the precise exponent c that the construction yields, to avoid ambiguity with the O(log T / sqrt(T)) rate already known for standard schedules.
  2. Notation for the stopping time τ and the sequence {η_t} is introduced without a dedicated preliminary section; a short table or paragraph collecting all symbols would improve readability for readers unfamiliar with the anytime setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for recommending acceptance and for the constructive comment on the lower-bound construction. We address the point below.

read point-by-point responses
  1. Referee: The lower-bound construction (detailed after the statement of the main theorem) relies on a specific piecewise-linear convex function; while the argument shows existence for any {η_t}, it would strengthen the claim to explicitly verify that the same construction works uniformly over all L-Lipschitz convex functions without additional restrictions on the domain or smoothness.

    Authors: We thank the referee for this observation. Our lower bound is existential: for any fixed stepsize sequence {η_t}, there exists an L-Lipschitz convex function (the constructed piecewise-linear one) and stopping time τ such that the last-iterate excess is Ω((log τ)^c / sqrt(τ)). This is sufficient to establish that no horizon-independent schedule can guarantee the optimal rate uniformly over the entire function class. The adversarial function is convex and L-Lipschitz, defined on the real line (no bounded-domain restriction), and non-smooth, matching the theorem assumptions exactly with no additional restrictions imposed. To make this verification explicit as suggested, we will add a clarifying remark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; impossibility proof is self-contained

full rationale

The manuscript proves an impossibility result: for any fixed stepsize sequence, there exists a convex L-Lipschitz function and stopping time such that last-iterate excess is Ω((log τ)^c / √τ). This is established via an explicit adversarial construction using only standard convex-analysis tools (e.g., properties of subgradients and Lipschitz continuity). The argument does not invoke fitted parameters, self-definitional quantities, or load-bearing self-citations; the Jain et al. conjecture is merely the statement being proved, not an input that the proof reduces to. All steps are externally verifiable from first principles without reference to the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on the standard domain assumptions of convexity and Lipschitz continuity together with the definition of last-iterate convergence; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The objective function is convex and L-Lipschitz continuous.
    Standard assumption invoked for GD/SGD convergence analysis in the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1186 out tokens · 29819 ms · 2026-05-10T12:22:59.077780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization

    Jason M Altschuler and Pablo A Parrilo. Acceleration by stepsize hedging: Silver stepsize schedule for smooth convex optimization. Mathematical Programming, pages 1--14, 2024

  2. [2]

    Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

    Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime. arXiv preprint arXiv:2507.11274, 2025

  3. [3]

    Large-scale machine learning with stochastic gradient descent

    L \'e on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, pages 177--186. Springer, 2010

  4. [4]

    Last iterate convergence of incremental methods and applications in continual learning

    Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. In International Conference on Learning Representations, 2025

  5. [5]

    From continual learning to sgd and back: Better rates for continual linear models

    Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. In Fourth Conference on Lifelong Learning Agents-Workshop Track, 2025

  6. [6]

    Garrigos, D

    Guillaume Garrigos, Daniel Cortild, Lucas Ketels, and Juan Peypouquet. Last-iterate complexity of sgd for convex and smooth stochastic problems. arXiv preprint arXiv:2507.14122, 2025

  7. [7]

    Deep learning, volume 1

    Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016

  8. [8]

    Sgd: General analysis and improved rates

    Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt \'a rik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200--5209. PMLR, 2019

  9. [9]

    Accelerated objective gap and gradient norm convergence for gradient descent via long steps

    Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. INFORMS Journal on Optimization, 7 0 (2): 0 156--169, 2025

  10. [10]

    Tight analyses for non-smooth stochastic gradient descent

    Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579--1613. PMLR, 2019

  11. [11]

    Introduction to online convex optimization

    Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

  12. [12]

    Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

    Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15 0 (1): 0 2489--2512, 2014

  13. [13]

    Making the last iterate of sgd information theoretically optimal

    Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752--1755. PMLR, 2019

  14. [14]

    Open problem: Anytime convergence rate of gradient descent

    Guy Kornowski and Ohad Shamir. Open problem: Anytime convergence rate of gradient descent. In The Thirty Seventh Annual Conference on Learning Theory, pages 5335--5339. PMLR, 2024

  15. [15]

    A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

    Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012

  16. [16]

    On the last-iterate convergence of shuffling gradient methods

    Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024 a

  17. [17]

    Revisiting the last-iterate convergence of stochastic gradient methods

    Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024 b

  18. [18]

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011

  19. [19]

    Robust stochastic approximation approach to stochastic programming

    Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 0 (4): 0 1574--1609, 2009

  20. [20]

    Problem complexity and method efficiency in optimization

    Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983

  21. [21]

    The asymptotic density of sequences

    Ivan Niven. The asymptotic density of sequences. Bulletin of the American Mathematical Society, 57 0 (6): 0 420--434, 1951

  22. [22]

    Acceleration of stochastic approximation by averaging

    Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30 0 (4): 0 838--855, 1992

  23. [23]

    Making gradient descent optimal for strongly convex stochastic optimization

    Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages 1571--1578, 2012

  24. [24]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

  25. [25]

    Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1

    Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory, pages 47--1. JMLR Workshop and Conference Proceedings, 2012

  26. [26]

    Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

    Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71--79. PMLR, 2013

  27. [27]

    Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions

    Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In Conference on Learning Theory, pages 2934--2992. PMLR, 2019

  28. [28]

    Last iterate convergence of sgd for least-squares in the interpolation regime

    Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34: 0 21581--21591, 2021

  29. [29]

    Exact convergence rate of the last iterate in subgradient methods

    Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023

  30. [30]

    Solving large scale linear prediction problems using stochastic gradient descent algorithms

    Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004

  31. [31]

    Anytime acceleration of gradient descent

    Zihan Zhang, Jason Lee, Simon Du, and Yuxin Chen. Anytime acceleration of gradient descent. In Proceedings of Thirty Eighth Conference on Learning Theory, pages 5991--6013. PMLR, 2025