pith. sign in

arxiv: 2606.00520 · v1 · pith:IL642OS3new · submitted 2026-05-30 · 🧮 math.OC · cs.LG· stat.ML

In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise

Pith reviewed 2026-06-28 18:34 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords stochastic gradient descentheavy-tailed noiseconvergence in expectationmirror descentconvex optimizationnonconvex optimizationmomentum methods
0
0 comments X

The pith

Stochastic gradient methods converge in expectation under heavy-tailed noise without bounded domains or changes to the algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard stochastic optimization algorithms continue to converge in expectation even when the noise in the gradients has only a finite moment of order p where p lies between 1 and 2. It establishes this for stochastic mirror descent and its accelerated version on convex problems, and for plain SGD and momentum SGD on nonconvex problems. The results remove the bounded-domain restriction that appeared in earlier positive findings and supply a unified analysis framework that applies without modifying the update rules themselves.

Core claim

Under the heavy-tailed noise assumption that the stochastic gradient has finite p-th moment for p in (1,2), Stochastic Mirror Descent and Accelerated Stochastic Mirror Descent converge in expectation for convex optimization, while SGD and Stochastic Gradient Descent with Momentum converge in expectation for nonconvex optimization; these guarantees hold without any algorithmic modification and without requiring bounded feasible sets.

What carries the argument

In-expectation convergence analysis for mirror-descent and momentum updates that closes directly from moment bounds on the noise rather than almost-sure bounds.

If this is right

  • SMD converges in expectation on unbounded convex problems under heavy-tailed noise.
  • ASMD inherits the same convergence guarantee for convex problems.
  • SGD converges in expectation on nonconvex problems under the same noise model.
  • SGDM also converges in expectation on nonconvex problems.
  • The same moment-based arguments apply uniformly to both convex and nonconvex settings without extra restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework may extend to other first-order methods whose proofs rely on similar expectation recursions.
  • Practical heavy-tailed noise in training data could be handled by existing optimizers rather than requiring specialized robust variants.
  • Relaxing the moment assumption further to p=1 would test whether the current analysis is tight.

Load-bearing premise

The objective satisfies the convexity or smoothness conditions needed for the mirror-descent or momentum analysis, and the noise satisfies the stated finite-moment bounds.

What would settle it

A convex problem with heavy-tailed gradient noise of moment order 1.5 on which the expected suboptimality of SMD fails to decrease to zero.

read the original abstract

Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient Descent ($\textsf{SGD}$), without any modification to its update rule, can surprisingly converge in expectation for convex problems with bounded domains, highlighting the potential of classical stochastic gradient methods. Inspired by this recent progress, we provide a comprehensive study of stochastic optimization under heavy-tailed noise and establish new in-expectation convergence results for Stochastic Mirror Descent ($\textsf{SMD}$) and Accelerated Stochastic Mirror Descent ($\textsf{ASMD}$) in convex optimization, and for $\textsf{SGD}$ and Stochastic Gradient Descent with Momentum ($\textsf{SGDM}$) in nonconvex optimization. Notably, our results not only hold without algorithmic changes but also avoid restrictive assumptions, such as bounded domains, imposed in prior work. More importantly, our analysis provides a new, elegant, and powerful framework for studying heavy-tailed stochastic optimization, opening a new route to understanding first-order stochastic gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims to establish new in-expectation convergence guarantees for Stochastic Mirror Descent (SMD) and Accelerated SMD under convex optimization, and for SGD and SGDM under nonconvex optimization, when stochastic gradients have only finite p-th moments for p ∈ (1,2). The results are obtained without algorithmic modifications and without imposing bounded-domain assumptions that appeared in prior work; a new analysis framework is introduced to handle the heavy-tailed case via direct expectation bounds.

Significance. If the stated conditions and derivations hold, the contribution is significant: it removes a restrictive bounded-domain hypothesis while retaining standard first-order methods, thereby widening the set of noise distributions for which convergence in expectation is provable. The proposed framework is presented as a reusable tool for heavy-tailed analyses and receives explicit credit for avoiding post-hoc restrictions or circular parameter definitions.

minor comments (2)
  1. [Abstract] Abstract: the precise moment index p and the exact regularity conditions (e.g., L-smoothness or strong convexity parameters) are invoked but not enumerated; adding one sentence listing them would improve immediate readability without altering the technical content.
  2. [Section 3] Notation: the definition of the mirror map and its associated Bregman divergence should be recalled in the statement of the main theorems (rather than only in the preliminaries) so that the dependence on the geometry is transparent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and for recommending acceptance. We are pleased that the contribution—new in-expectation convergence results for SMD, ASMD, SGD, and SGDM under heavy-tailed noise without bounded-domain assumptions or algorithmic modifications—is viewed as significant.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard assumptions and direct bounds

full rationale

The manuscript presents convergence proofs for SMD/ASMD (convex) and SGD/SGDM (nonconvex) under finite p-moment noise (p in (1,2)). The required conditions—convexity or L-smoothness plus explicit noise-moment bounds—are stated explicitly at the outset and are the standard regularity conditions for the respective mirror-descent and momentum analyses. These assumptions are not defined in terms of the target convergence rates, nor are any parameters fitted to data and then relabeled as predictions. No load-bearing self-citation chain appears; the framework uses direct expectation recursions rather than ansatzes imported from prior author work or uniqueness theorems. The claims therefore remain independent of their own outputs and do not reduce by construction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, invented entities, or ad-hoc axioms are stated. Standard domain assumptions on convexity/smoothness and noise moments are implicitly required but not enumerated.

axioms (1)
  • domain assumption The objective functions satisfy convexity (for SMD/ASMD) or appropriate smoothness (for SGD/SGDM) together with the finite p-moment condition on stochastic gradients for p in (1,2).
    These are the minimal conditions needed to state the claimed convergence results; they are invoked by the choice of convex versus nonconvex regimes in the abstract.

pith-pipeline@v0.9.1-grok · 5730 in / 1523 out tokens · 20675 ms · 2026-06-28T18:34:41.699181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Lower bounds for non-convex stochastic optimization

    Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming , 199(1-2):165--214, 2023

  2. [2]

    Linear attention is (maybe) all you need (to understand transformer optimization)

    Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, and Suvrit Sra. Linear attention is (maybe) all you need (to understand transformer optimization). In The Twelfth International Conference on Learning Representations , 2024

  3. [3]

    Uniformly convex and uniformly smooth convex functions

    Dominique Az\'e and Jean-Paul Penot. Uniformly convex and uniformly smooth convex functions. Annales de la Facult\'e des sciences de Toulouse : Math\'ematiques , Ser. 6, 4(4):705--730, 1995

  4. [4]

    High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise

    Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors, Proceedings of The 28th International Conference on Artificial...

  5. [5]

    On linear convergence of non-euclidean gradient methods without strong convexity and lipschitz gradient continuity

    Heinz H Bauschke, J \'e r \^o me Bolte, Jiawei Chen, Marc Teboulle, and Xianfu Wang. On linear convergence of non-euclidean gradient methods without strong convexity and lipschitz gradient continuity. Journal of Optimization Theory and Applications , 182(3):1068--1087, 2019

  6. [6]

    Bauschke, J\' e r\^ o me Bolte, and Marc Teboulle

    Heinz H. Bauschke, J\' e r\^ o me Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research , 42(2):330--348, 2017

  7. [7]

    Curtis, and Jorge Nocedal

    L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review , 60(2):223--311, 2018

  8. [8]

    Mirror descent and nonlinear projected subgradient methods for convex optimization

    Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters , 31(3):167--175, 2003

  9. [9]

    Revisiting the noise model of stochastic gradient descent

    Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , volume 238 of Proceedings of Machine Learning Research , pages 4780--4788. PMLR, 02--04 May 2024

  10. [10]

    High-probability bounds for non-convex stochastic optimization with heavy tails

    Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems , 34:4883--4895, 2021

  11. [11]

    Composite objective mirror descent

    John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT , volume 10, pages 14--26. Citeseer, 2010

  12. [12]

    Optimal complexity and certification of bregman first-order methods

    Radu-Alexandru Dragomir, Adrien B Taylor, Alexandre d’Aspremont, and J \'e r \^o me Bolte. Optimal complexity and certification of bregman first-order methods. Mathematical Programming , 194(1):41--83, 2022

  13. [13]

    Can sgd handle heavy-tailed noise? arXiv preprint arXiv:2508.04860 , 2025

    Ilyas Fatkhullin, Florian H \"u bler, and Guanghui Lan. Can sgd handle heavy-tailed noise? arXiv preprint arXiv:2508.04860 , 2025

  14. [14]

    A study of condition numbers for first-order optimization

    Charles Guille-Escuret, Manuela Girotti, Baptiste Goujaud, and Ioannis Mitliagkas. A study of condition numbers for first-order optimization. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , volume 130 of Proceedings of Machine Learning Research , pages 1261--1269...

  15. [15]

    Global convergence of the heavy-ball method for convex optimization

    Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC) , pages 310--315, 2015

  16. [16]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013

  17. [17]

    A unified framework for bregman proximal methods: subgradient, gradient, and accelerated gradient schemes

    David H Gutman and Javier F Pena. A unified framework for bregman proximal methods: subgradient, gradient, and accelerated gradient schemes. arXiv preprint arXiv:1812.10198 , 2018

  18. [18]

    High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise

    Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horv\' a th, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt\' a rik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian We...

  19. [19]

    On proximal policy optimization's heavy-tailed gradients

    Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization's heavy-tailed gradients. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings ...

  20. [20]

    From gradient clipping to normalization for heavy tailed sgd

    Florian H \"u bler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors, Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , volume 258 of Proceedings of Machine Learning Research , pages 2413--2421. PM...

  21. [21]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

  22. [22]

    Krzysztof C. Kiwiel. Proximal minimization methods with generalized bregman functions. SIAM Journal on Control and Optimization , 35(4):1142--1168, 1997

  23. [23]

    An optimal method for stochastic composite optimization

    Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming , 133(1):365--397, 2012

  24. [24]

    First-order and stochastic optimization methods for machine learning

    Guanghui Lan. First-order and stochastic optimization methods for machine learning . Springer, 2020

  25. [25]

    Freund, and Yurii Nesterov

    Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization , 28(1):333--354, 2018

  26. [26]

    An improved analysis of stochastic gradient descent with momentum

    Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 18261--18271. Curran Associates, Inc., 2020

  27. [27]

    Online convex optimization with heavy tails: Old algorithms, new regrets, and applications

    Zijian Liu. Online convex optimization with heavy tails: Old algorithms, new regrets, and applications. arXiv preprint arXiv:2508.07473 , 2025

  28. [28]

    Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

    Zijian Liu. Can adaptive gradient methods converge under heavy-tailed noise? a case study of adagrad. arXiv preprint arXiv:2605.18694 , 2026

  29. [29]

    Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis

    Zijian Liu. Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis. In The Fourteenth International Conference on Learning Representations , 2026

  30. [30]

    relative continuity

    Haihao Lu. “relative continuity” for non-lipschitz nonsmooth convex optimization using stochastic (or deterministic) mirror descent. INFORMS Journal on Optimization , 1(4):288--303, 2019

  31. [31]

    High-probability bound for non-smooth non-convex stochastic optimization with heavy tails

    Langqi Liu, Yibo Wang, and Lijun Zhang. High-probability bound for non-smooth non-convex stochastic optimization with heavy tails. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning , volume 235 of Procee...

  32. [32]

    Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

    Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation. arXiv preprint arXiv:2303.12277 , 2023

  33. [33]

    Revisiting the last-iterate convergence of stochastic gradient methods

    Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations , 2024

  34. [34]

    Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

    Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. In The Thirteenth International Conference on Learning Representations , 2025

  35. [35]

    Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

    Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory , volume 195 of Proceedings of Machine Learning Research , pages 2266--2290. PMLR,...

  36. [36]

    Minimization methods for nonsmooth convex and quasiconvex functions

    Yurii E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon , 29(3):519--531, 1984

  37. [37]

    Improved convergence in high probability of clipped gradient methods with heavy tailed noise

    Ta Duy Nguyen, Thien H Nguyen, Alina Ene, and Huy Nguyen. Improved convergence in high probability of clipped gradient methods with heavy tailed noise. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 24191--24222. Curran Associates, Inc., 2023

  38. [38]

    Linear convergence of first order methods for non-strongly convex optimization

    Ion Necoara, Yu Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical programming , 175(1):69--107, 2019

  39. [39]

    Problem complexity and method efficiency in optimization

    Arkadi Nemirovski and David Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience , 1983

  40. [40]

    Online Learning: A Modern Introduction Using Convex Optimization

    Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213 , 2019

  41. [41]

    Breaking the heavy-tailed noise barrier in stochastic optimization problems

    Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , volume 238 of Proceedings of Machine Learning Resea...

  42. [42]

    Best possible bounds of the von Bahr--Esseen type

    Iosif Pinelis. Best possible bounds of the von Bahr--Esseen type . Annals of Functional Analysis , 6(4):1 -- 29, 2015

  43. [43]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning , volume 28 of Proceedings of Machine Learning Research , pages 1310--1318, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR

  44. [44]

    B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics , 3(4):864--878, 1963

  45. [45]

    B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4(5):1--17, 1964

  46. [46]

    Boris T. Polyak. Introduction to optimization . New York, Optimization Software, 1987

  47. [47]

    An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise

    Daniela Angela Parletta, Andrea Paudice, and Saverio Salzo. An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise. arXiv preprint arXiv:2410.00573 , 2024

  48. [48]

    A Stochastic Approximation Method

    Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics , 22(3):400 -- 407, 1951

  49. [49]

    High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

    Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horv\' a th, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt\' a rik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and ...

  50. [50]

    Revisiting gradient normalization and clipping for nonconvex sgd under heavy-tailed noise: Necessity, sufficiency, and acceleration

    Tao Sun, Xinwang Liu, and Kun Yuan. Revisiting gradient normalization and clipping for nonconvex sgd under heavy-tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research , 26(237):1--42, 2025

  51. [51]

    A tail-index analysis of stochastic gradient noise in deep neural networks

    Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research , pages 5827--5837. PMLR, 09--15 Jun 2019

  52. [52]

    Inequalities for the rth absolute moment of a sum of random variables, 1 r 2

    Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random variables, 1 r 2 . The Annals of Mathematical Statistics , 36(1):299--303, 1965

  53. [53]

    Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance

    Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory , volume 178 of Proceedings of Machine Learning Research , pages...

  54. [54]

    Closing the gap between the upper bound and lower bound of adam s iteration complexity

    Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam s iteration complexity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 39006--39032. Curran Associates, Inc., 2023

  55. [55]

    Convergence rates of stochastic gradient descent under infinite noise variance

    Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , volume 34, pages 18866--18877. Curran Associates, I...

  56. [56]

    On the lower bound of minimizing polyak-Łojasiewicz functions

    Pengyun Yue, Cong Fang, and Zhouchen Lin. On the lower bound of minimizing polyak-Łojasiewicz functions. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory , volume 195 of Proceedings of Machine Learning Research , pages 2948--2968. PMLR, 12--15 Jul 2023

  57. [57]

    Parameter-free regret in high probability with heavy tails

    Jiujia Zhang and Ashok Cutkosky. Parameter-free regret in high probability with heavy tails. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 8000--8012. Curran Associates, Inc., 2022

  58. [58]

    Proximal-like incremental aggregated gradient method with linear convergence under bregman distance growth conditions

    Hui Zhang, Yu-Hong Dai, Lei Guo, and Wei Peng. Proximal-like incremental aggregated gradient method with linear convergence under bregman distance growth conditions. Mathematics of Operations Research , 46(1):61--81, 2021

  59. [59]

    Exact convergence rate of the last iterate in subgradient methods

    Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. SIAM Journal on Optimization , 35(3):2182--2201, 2025

  60. [60]

    Why are adaptive methods good for attention models? In H

    Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 15383--15393. Curran Associates, Inc., 2020

  61. [61]

    Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses

    Yihan Zhou, Victor Sanches Portella, Mark Schmidt, and Nicholas Harvey. Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 15823--15833. Curran Associates, Inc., 2020

  62. [62]

    Zǎlinescu

    C. Zǎlinescu. On uniformly convex functions. Journal of Mathematical Analysis and Applications , 95(2):344--374, 1983