New Bounds for the Last Iterate of the Stochastic subGradient Method

Andrea Paudice; Guglielmo Beretta; Roberto Colomboni; Tommaso Cesari

arxiv: 2606.24879 · v1 · pith:5R5VCZSCnew · submitted 2026-06-23 · 🧮 math.OC · cs.LG

New Bounds for the Last Iterate of the Stochastic subGradient Method

Guglielmo Beretta , Tommaso Cesari , Roberto Colomboni , Andrea Paudice This is my paper

Pith reviewed 2026-06-25 22:41 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords stochastic subgradient methodlast iterate convergenceconvex optimizationstochastic optimizationone-dimensional optimizationoptimization error boundsi.i.d. noise

0 comments

The pith

Under i.i.d. additive subgradient noise with bounded variance, the last iterate of stochastic subgradient descent reaches order 1/√n error in one dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the final point generated by the stochastic subgradient method on one-dimensional convex Lipschitz problems. With the usual fixed stepsize schedule of order 1/√n, it shows that additive i.i.d. noise of uniformly bounded variance produces an optimization error of order 1/√n for the last iterate. This removes the extra logarithmic factor that appears in earlier generic analyses. The same setting without the i.i.d. requirement allows the error to reach order (log n)/√n, giving a negative answer to an open question on whether the last iterate can be optimal under only bounded-variance noise.

Core claim

For fixed horizon n and stepsize η = Θ(1/√n), when subgradient noise is additive, i.i.d., and has uniformly bounded variance, the optimization error of the last iterate is of order 1/√n for one-dimensional convex Lipschitz objectives. Dropping the i.i.d. assumption permits error of order (log n)/√n, showing that the last iterate is suboptimal under bounded variance alone.

What carries the argument

Last-iterate error analysis that exploits one-dimensional convexity together with the additive i.i.d. structure of the noise.

If this is right

The last iterate matches the known rate of the averaged iterate when the noise satisfies the i.i.d. condition.
Existing generic bounds that carry an extra log n factor are not tight under the i.i.d. assumption.
The log n factor reappears as soon as the noise process loses the i.i.d. property, even in dimension one.
The result separates the effect of noise correlation from the effect of dimension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses of other first-order stochastic methods may improve when they explicitly use i.i.d. structure rather than only variance bounds.
The distinction between i.i.d. and merely bounded-variance noise could matter for last-iterate guarantees in non-convex or higher-dimensional settings.
Practical implementations that average iterates might be replaceable by the raw last iterate if the noise is known to be i.i.d.

Load-bearing premise

The subgradient noise must be additive, independent across steps, identically distributed, and have uniformly bounded variance.

What would settle it

An explicit one-dimensional convex Lipschitz function and a sequence of i.i.d. bounded-variance noises for which the last iterate error exceeds order 1/√n would falsify the positive bound.

read the original abstract

We study the last iterate of the stochastic subgradient method for one-dimensional convex Lipschitz objectives. For a fixed horizon $n$, we consider the standard fixed stepsizes $\eta =\Theta(1/\sqrt n)$. We prove that, for such stepsize policies, under additive i.i.d. subgradient noise with uniformly bounded variance, the last iterate features an optimization error of order $1/\sqrt n$, thereby removing the extra $(\log n)$ factor present in existing generic bounds. On the other hand, we show that without the i.i.d. assumption, the optimization error can be of order $(\log n)/\sqrt n$. Thus, under the uniformly bounded variance assumption alone, the last iterate of SsGM is suboptimal even in dimension one, resolving negatively an open problem posed in Koren and Segal, COLT, 2020.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper cleanly separates i.i.d. vs non-i.i.d. noise to get O(1/sqrt n) last-iterate rate under the stronger assumption in 1D, with a matching-style lower bound otherwise.

read the letter

The main point is that this work gives an O(1/sqrt n) last-iterate bound for stochastic subgradient on 1D convex Lipschitz problems when the noise is additive, i.i.d., and has bounded variance, using the usual fixed stepsize of order 1/sqrt n. It also shows that dropping the i.i.d. condition allows a (log n)/sqrt n lower bound even in one dimension.

The positive result removes the extra log factor that shows up in generic last-iterate analyses. The negative result directly answers the Koren-Segal 2020 open question in the negative under uniform bounded variance alone. Both directions are stated with explicit noise assumptions and fixed stepsizes, which keeps the claims sharp.

The separation of cases is useful because i.i.d. noise is common in streaming or repeated sampling settings. The derivations appear to come straight from the standard recursion without circular definitions or post-hoc adjustments.

The main limitation is the restriction to dimension one; nothing is said about whether the same distinction holds or fails in higher dimensions. The lower-bound construction uses non-i.i.d. noise, and it would help to know how artificial that example is. Full proofs are not in the abstract, so any referee would need to verify the constants and the exact conditions under which the bounds hold.

This is for people working on convergence rates for stochastic first-order methods, especially those who care about last-iterate guarantees rather than ergodic averages. It is narrow but precise.

I would send it to peer review. The claims address a concrete open problem with matching-style bounds under different assumptions, and the work is grounded enough to deserve expert scrutiny on the derivations.

Referee Report

0 major / 2 minor

Summary. The paper studies the last iterate of the stochastic subgradient method for one-dimensional convex Lipschitz objectives. With fixed stepsizes η=Θ(1/√n), it proves that additive i.i.d. subgradient noise with uniformly bounded variance yields last-iterate optimization error of order 1/√n (removing the extra log n factor from generic bounds). Without the i.i.d. assumption the error can reach (log n)/√n, negatively resolving an open problem from Koren and Segal (COLT 2020).

Significance. If the derivations hold, the result is significant for clarifying the precise role of the i.i.d. assumption in last-iterate rates for stochastic subgradient methods, even in dimension one. It supplies both a positive rate under the stronger noise model and an explicit counter-example construction under uniform bounded variance alone, directly addressing a stated open question.

minor comments (2)

The abstract states the positive and negative results cleanly but does not indicate the theorem numbers or sections where the i.i.d. positive bound and the non-i.i.d. construction appear; adding these cross-references would aid readers.
Notation for the noise process (additive i.i.d. with bounded variance) is introduced in the abstract; a short dedicated paragraph in §2 or §3 defining the precise probability space and moment assumptions would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of our results, and recommendation of minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation proceeds from the standard subgradient recursion under the paper's explicit assumptions (additive i.i.d. noise with uniformly bounded variance for the positive O(1/√n) result, and the same variance bound without i.i.d. for the negative (log n)/√n counter-example). No step reduces the claimed rate to a quantity defined by the same paper, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The separation of results follows directly from the differing noise conditions applied to the same recursion, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on standard domain assumptions of one-dimensional convex Lipschitz functions and additive noise; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Objective is convex and Lipschitz continuous in one dimension
Invoked to guarantee existence of subgradients and to control the recursion.
domain assumption Subgradient noise is additive with uniformly bounded variance
Central modeling choice that enables the 1/sqrt n rate; contrasted with non-i.i.d. case.

pith-pipeline@v0.9.1-grok · 5683 in / 1348 out tokens · 23379 ms · 2026-06-25T22:41:18.059016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 linked inside Pith

[1]

Bartlett, Pradeep Ravikumar, and Martin J

Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information- theoretic lower bounds on the oracle complexity of stochastic convex optimization.IEEE Transactions on Information Theory, 58(5):3235–3249, 2012

2012
[2]

MIT press, 2024

Francis Bach.Learning theory from first principles. MIT press, 2024

2024
[3]

General tail bounds for non-smooth stochastic mirror descent

Khaled Eldowa and Andrea Paudice. General tail bounds for non-smooth stochastic mirror descent. InProceedings of the 27-th International Conference on Artificial Intelligence and Statistics, pages 3205–3213, 2024

2024
[4]

Yu. M. Ermol’ev. On the method of generalized stochastic gradients and quasi-Féjer sequences. Cybernetics, 5:208–220, 1969

1969
[5]

Nicholas J. A. Harvey, Chris Liaw, and Sikander Randhawa. Tight analyses for subgradient descent I: lower bounds.Open J. Math. Optim., 5:1–17, 2024

2024
[6]

Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa

Nicholas J.A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InProceedinds of the 32nd International Conference on Computational Learning Theory, pages 1579–1613, 2019. 20

2019
[7]

Nagaraj, and Praneeth Netrapalli

Prateek Jain, Dheeraj M. Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal.SIAM Journal on Optimization, 31(2):1108–1130, 2021

2021
[8]

Springer, 3 edition, 2021

Olav Kallenberg.Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 3 edition, 2021

2021
[9]

Open problem: Tight convergence of SGD in constant dimension

Tomer Koren and Shahar Segal. Open problem: Tight convergence of SGD in constant dimension. InCOLT 2020, volume 125 ofProceedings of Machine Learning Research, pages 3847–3851, 2020

2020
[10]

Gradient descent’s last iterate is often (slightly) suboptimal

Guy Kornowski and Ohad Shamir. Gradient descent’s last iterate is often (slightly) suboptimal. Preprint at arXiv:2604.13870, April 2026

Pith/arXiv arXiv 2026
[11]

The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021

Daogao Liu and Zhou Lu. The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021. Preprint arXiv:2106.14588

arXiv 2021
[12]

Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

arXiv 2023
[13]

Revisiting the last-iterate convergence of stochastic gradient methods

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. InProceedings of the 12-th International Conference on Learning Representations (to appear), 2024

2024
[14]

Meyn and Richard L

Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge University Press, 2 edition, 2009

2009
[15]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009

2009
[16]

Wiley-Interscience, 1983

Arkadij Semenovič Nemirovskij and David Borisovich Yudin.Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983

1983
[17]

An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025

Daniela Angela Parletta, Andrea Paudice, and Saverio Salzo. An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025

2025
[18]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th International Conference on Machine Learning, 2013

2013
[19]

Serdar Yüksel and Sean P. Meyn. Random-time, state-dependent stochastic drift for Markov chains and application to stochastic stabilization over erasure channels, 2012. Preprint at arXiv:1010.4820

Pith/arXiv arXiv 2012
[20]

Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025. 21 A Proofs of the Deferred Claims A.1 Proof of Claim 7 Proof. We first prove thatP satisfies the Feller property. Recall that a transition probability kernel Pon the compact metric spaceIis Feller if φ...

2025
[21]

If α > a, then everyx∈ [α, α+ ρ) ∩ X lies strictly to the left ofS, and every relative subgradient g∈∂ X f(x)satisfiesg≤0
[22]

If β < b , then every x∈ (β−ρ, β ] ∩ X lies strictly to the right of S, and every relative subgradientg∈∂ X f(x)satisfiesg≥0. Proof. Since S is a non-empty closed convex subset of the intervalX, it is a closed interval inX. Write u= infS, v= supS. If α > a , then the left endpoint of the enlarged set is not created by the boundary ofX, hence α = u−ρ . The...

[1] [1]

Bartlett, Pradeep Ravikumar, and Martin J

Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information- theoretic lower bounds on the oracle complexity of stochastic convex optimization.IEEE Transactions on Information Theory, 58(5):3235–3249, 2012

2012

[2] [2]

MIT press, 2024

Francis Bach.Learning theory from first principles. MIT press, 2024

2024

[3] [3]

General tail bounds for non-smooth stochastic mirror descent

Khaled Eldowa and Andrea Paudice. General tail bounds for non-smooth stochastic mirror descent. InProceedings of the 27-th International Conference on Artificial Intelligence and Statistics, pages 3205–3213, 2024

2024

[4] [4]

Yu. M. Ermol’ev. On the method of generalized stochastic gradients and quasi-Féjer sequences. Cybernetics, 5:208–220, 1969

1969

[5] [5]

Nicholas J. A. Harvey, Chris Liaw, and Sikander Randhawa. Tight analyses for subgradient descent I: lower bounds.Open J. Math. Optim., 5:1–17, 2024

2024

[6] [6]

Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa

Nicholas J.A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InProceedinds of the 32nd International Conference on Computational Learning Theory, pages 1579–1613, 2019. 20

2019

[7] [7]

Nagaraj, and Praneeth Netrapalli

Prateek Jain, Dheeraj M. Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal.SIAM Journal on Optimization, 31(2):1108–1130, 2021

2021

[8] [8]

Springer, 3 edition, 2021

Olav Kallenberg.Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 3 edition, 2021

2021

[9] [9]

Open problem: Tight convergence of SGD in constant dimension

Tomer Koren and Shahar Segal. Open problem: Tight convergence of SGD in constant dimension. InCOLT 2020, volume 125 ofProceedings of Machine Learning Research, pages 3847–3851, 2020

2020

[10] [10]

Gradient descent’s last iterate is often (slightly) suboptimal

Guy Kornowski and Ohad Shamir. Gradient descent’s last iterate is often (slightly) suboptimal. Preprint at arXiv:2604.13870, April 2026

Pith/arXiv arXiv 2026

[11] [11]

The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021

Daogao Liu and Zhou Lu. The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021. Preprint arXiv:2106.14588

arXiv 2021

[12] [12]

Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

arXiv 2023

[13] [13]

Revisiting the last-iterate convergence of stochastic gradient methods

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. InProceedings of the 12-th International Conference on Learning Representations (to appear), 2024

2024

[14] [14]

Meyn and Richard L

Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge University Press, 2 edition, 2009

2009

[15] [15]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009

2009

[16] [16]

Wiley-Interscience, 1983

Arkadij Semenovič Nemirovskij and David Borisovich Yudin.Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983

1983

[17] [17]

An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025

Daniela Angela Parletta, Andrea Paudice, and Saverio Salzo. An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025

2025

[18] [18]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th International Conference on Machine Learning, 2013

2013

[19] [19]

Serdar Yüksel and Sean P. Meyn. Random-time, state-dependent stochastic drift for Markov chains and application to stochastic stabilization over erasure channels, 2012. Preprint at arXiv:1010.4820

Pith/arXiv arXiv 2012

[20] [20]

Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025. 21 A Proofs of the Deferred Claims A.1 Proof of Claim 7 Proof. We first prove thatP satisfies the Feller property. Recall that a transition probability kernel Pon the compact metric spaceIis Feller if φ...

2025

[21] [21]

If α > a, then everyx∈ [α, α+ ρ) ∩ X lies strictly to the left ofS, and every relative subgradient g∈∂ X f(x)satisfiesg≤0

[22] [22]

If β < b , then every x∈ (β−ρ, β ] ∩ X lies strictly to the right of S, and every relative subgradientg∈∂ X f(x)satisfiesg≥0. Proof. Since S is a non-empty closed convex subset of the intervalX, it is a closed interval inX. Write u= infS, v= supS. If α > a , then the left endpoint of the enlarged set is not created by the boundary ofX, hence α = u−ρ . The...