New Bounds for the Last Iterate of the Stochastic subGradient Method
Pith reviewed 2026-06-25 22:41 UTC · model grok-4.3
The pith
Under i.i.d. additive subgradient noise with bounded variance, the last iterate of stochastic subgradient descent reaches order 1/√n error in one dimension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For fixed horizon n and stepsize η = Θ(1/√n), when subgradient noise is additive, i.i.d., and has uniformly bounded variance, the optimization error of the last iterate is of order 1/√n for one-dimensional convex Lipschitz objectives. Dropping the i.i.d. assumption permits error of order (log n)/√n, showing that the last iterate is suboptimal under bounded variance alone.
What carries the argument
Last-iterate error analysis that exploits one-dimensional convexity together with the additive i.i.d. structure of the noise.
If this is right
- The last iterate matches the known rate of the averaged iterate when the noise satisfies the i.i.d. condition.
- Existing generic bounds that carry an extra log n factor are not tight under the i.i.d. assumption.
- The log n factor reappears as soon as the noise process loses the i.i.d. property, even in dimension one.
- The result separates the effect of noise correlation from the effect of dimension.
Where Pith is reading between the lines
- Analyses of other first-order stochastic methods may improve when they explicitly use i.i.d. structure rather than only variance bounds.
- The distinction between i.i.d. and merely bounded-variance noise could matter for last-iterate guarantees in non-convex or higher-dimensional settings.
- Practical implementations that average iterates might be replaceable by the raw last iterate if the noise is known to be i.i.d.
Load-bearing premise
The subgradient noise must be additive, independent across steps, identically distributed, and have uniformly bounded variance.
What would settle it
An explicit one-dimensional convex Lipschitz function and a sequence of i.i.d. bounded-variance noises for which the last iterate error exceeds order 1/√n would falsify the positive bound.
read the original abstract
We study the last iterate of the stochastic subgradient method for one-dimensional convex Lipschitz objectives. For a fixed horizon $n$, we consider the standard fixed stepsizes $\eta =\Theta(1/\sqrt n)$. We prove that, for such stepsize policies, under additive i.i.d. subgradient noise with uniformly bounded variance, the last iterate features an optimization error of order $1/\sqrt n$, thereby removing the extra $(\log n)$ factor present in existing generic bounds. On the other hand, we show that without the i.i.d. assumption, the optimization error can be of order $(\log n)/\sqrt n$. Thus, under the uniformly bounded variance assumption alone, the last iterate of SsGM is suboptimal even in dimension one, resolving negatively an open problem posed in Koren and Segal, COLT, 2020.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the last iterate of the stochastic subgradient method for one-dimensional convex Lipschitz objectives. With fixed stepsizes η=Θ(1/√n), it proves that additive i.i.d. subgradient noise with uniformly bounded variance yields last-iterate optimization error of order 1/√n (removing the extra log n factor from generic bounds). Without the i.i.d. assumption the error can reach (log n)/√n, negatively resolving an open problem from Koren and Segal (COLT 2020).
Significance. If the derivations hold, the result is significant for clarifying the precise role of the i.i.d. assumption in last-iterate rates for stochastic subgradient methods, even in dimension one. It supplies both a positive rate under the stronger noise model and an explicit counter-example construction under uniform bounded variance alone, directly addressing a stated open question.
minor comments (2)
- The abstract states the positive and negative results cleanly but does not indicate the theorem numbers or sections where the i.i.d. positive bound and the non-i.i.d. construction appear; adding these cross-references would aid readers.
- Notation for the noise process (additive i.i.d. with bounded variance) is introduced in the abstract; a short dedicated paragraph in §2 or §3 defining the precise probability space and moment assumptions would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of our results, and recommendation of minor revision. No major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The derivation proceeds from the standard subgradient recursion under the paper's explicit assumptions (additive i.i.d. noise with uniformly bounded variance for the positive O(1/√n) result, and the same variance bound without i.i.d. for the negative (log n)/√n counter-example). No step reduces the claimed rate to a quantity defined by the same paper, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The separation of results follows directly from the differing noise conditions applied to the same recursion, making the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective is convex and Lipschitz continuous in one dimension
- domain assumption Subgradient noise is additive with uniformly bounded variance
Reference graph
Works this paper leans on
-
[1]
Bartlett, Pradeep Ravikumar, and Martin J
Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information- theoretic lower bounds on the oracle complexity of stochastic convex optimization.IEEE Transactions on Information Theory, 58(5):3235–3249, 2012
2012
-
[2]
MIT press, 2024
Francis Bach.Learning theory from first principles. MIT press, 2024
2024
-
[3]
General tail bounds for non-smooth stochastic mirror descent
Khaled Eldowa and Andrea Paudice. General tail bounds for non-smooth stochastic mirror descent. InProceedings of the 27-th International Conference on Artificial Intelligence and Statistics, pages 3205–3213, 2024
2024
-
[4]
Yu. M. Ermol’ev. On the method of generalized stochastic gradients and quasi-Féjer sequences. Cybernetics, 5:208–220, 1969
1969
-
[5]
Nicholas J. A. Harvey, Chris Liaw, and Sikander Randhawa. Tight analyses for subgradient descent I: lower bounds.Open J. Math. Optim., 5:1–17, 2024
2024
-
[6]
Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa
Nicholas J.A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InProceedinds of the 32nd International Conference on Computational Learning Theory, pages 1579–1613, 2019. 20
2019
-
[7]
Nagaraj, and Praneeth Netrapalli
Prateek Jain, Dheeraj M. Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal.SIAM Journal on Optimization, 31(2):1108–1130, 2021
2021
-
[8]
Springer, 3 edition, 2021
Olav Kallenberg.Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 3 edition, 2021
2021
-
[9]
Open problem: Tight convergence of SGD in constant dimension
Tomer Koren and Shahar Segal. Open problem: Tight convergence of SGD in constant dimension. InCOLT 2020, volume 125 ofProceedings of Machine Learning Research, pages 3847–3851, 2020
2020
-
[10]
Gradient descent’s last iterate is often (slightly) suboptimal
Guy Kornowski and Ohad Shamir. Gradient descent’s last iterate is often (slightly) suboptimal. Preprint at arXiv:2604.13870, April 2026
Pith/arXiv arXiv 2026
-
[11]
The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021
Daogao Liu and Zhou Lu. The convergence rate of SGD’s final iterate: Analysis on dimension dependence, 2021. Preprint arXiv:2106.14588
arXiv 2021
-
[12]
Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023
arXiv 2023
-
[13]
Revisiting the last-iterate convergence of stochastic gradient methods
Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. InProceedings of the 12-th International Conference on Learning Representations (to appear), 2024
2024
-
[14]
Meyn and Richard L
Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge University Press, 2 edition, 2009
2009
-
[15]
Nemirovski, A
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009
2009
-
[16]
Wiley-Interscience, 1983
Arkadij Semenovič Nemirovskij and David Borisovich Yudin.Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983
1983
-
[17]
An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025
Daniela Angela Parletta, Andrea Paudice, and Saverio Salzo. An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise, 2025
2025
-
[18]
Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th International Conference on Machine Learning, 2013
2013
-
[19]
Serdar Yüksel and Sean P. Meyn. Random-time, state-dependent stochastic drift for Markov chains and application to stochastic stabilization over erasure channels, 2012. Preprint at arXiv:1010.4820
Pith/arXiv arXiv 2012
-
[20]
Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025
Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025. 21 A Proofs of the Deferred Claims A.1 Proof of Claim 7 Proof. We first prove thatP satisfies the Feller property. Recall that a transition probability kernel Pon the compact metric spaceIis Feller if φ...
2025
-
[21]
If α > a, then everyx∈ [α, α+ ρ) ∩ X lies strictly to the left ofS, and every relative subgradient g∈∂ X f(x)satisfiesg≤0
-
[22]
If β < b , then every x∈ (β−ρ, β ] ∩ X lies strictly to the right of S, and every relative subgradientg∈∂ X f(x)satisfiesg≥0. Proof. Since S is a non-empty closed convex subset of the intervalX, it is a closed interval inX. Write u= infS, v= supS. If α > a , then the left endpoint of the enlarged set is not created by the boundary ofX, hence α = u−ρ . The...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.