In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise
Pith reviewed 2026-06-28 18:34 UTC · model grok-4.3
The pith
Stochastic gradient methods converge in expectation under heavy-tailed noise without bounded domains or changes to the algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the heavy-tailed noise assumption that the stochastic gradient has finite p-th moment for p in (1,2), Stochastic Mirror Descent and Accelerated Stochastic Mirror Descent converge in expectation for convex optimization, while SGD and Stochastic Gradient Descent with Momentum converge in expectation for nonconvex optimization; these guarantees hold without any algorithmic modification and without requiring bounded feasible sets.
What carries the argument
In-expectation convergence analysis for mirror-descent and momentum updates that closes directly from moment bounds on the noise rather than almost-sure bounds.
If this is right
- SMD converges in expectation on unbounded convex problems under heavy-tailed noise.
- ASMD inherits the same convergence guarantee for convex problems.
- SGD converges in expectation on nonconvex problems under the same noise model.
- SGDM also converges in expectation on nonconvex problems.
- The same moment-based arguments apply uniformly to both convex and nonconvex settings without extra restrictions.
Where Pith is reading between the lines
- The framework may extend to other first-order methods whose proofs rely on similar expectation recursions.
- Practical heavy-tailed noise in training data could be handled by existing optimizers rather than requiring specialized robust variants.
- Relaxing the moment assumption further to p=1 would test whether the current analysis is tight.
Load-bearing premise
The objective satisfies the convexity or smoothness conditions needed for the mirror-descent or momentum analysis, and the noise satisfies the stated finite-moment bounds.
What would settle it
A convex problem with heavy-tailed gradient noise of moment order 1.5 on which the expected suboptimality of SMD fails to decrease to zero.
read the original abstract
Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient Descent ($\textsf{SGD}$), without any modification to its update rule, can surprisingly converge in expectation for convex problems with bounded domains, highlighting the potential of classical stochastic gradient methods. Inspired by this recent progress, we provide a comprehensive study of stochastic optimization under heavy-tailed noise and establish new in-expectation convergence results for Stochastic Mirror Descent ($\textsf{SMD}$) and Accelerated Stochastic Mirror Descent ($\textsf{ASMD}$) in convex optimization, and for $\textsf{SGD}$ and Stochastic Gradient Descent with Momentum ($\textsf{SGDM}$) in nonconvex optimization. Notably, our results not only hold without algorithmic changes but also avoid restrictive assumptions, such as bounded domains, imposed in prior work. More importantly, our analysis provides a new, elegant, and powerful framework for studying heavy-tailed stochastic optimization, opening a new route to understanding first-order stochastic gradient methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish new in-expectation convergence guarantees for Stochastic Mirror Descent (SMD) and Accelerated SMD under convex optimization, and for SGD and SGDM under nonconvex optimization, when stochastic gradients have only finite p-th moments for p ∈ (1,2). The results are obtained without algorithmic modifications and without imposing bounded-domain assumptions that appeared in prior work; a new analysis framework is introduced to handle the heavy-tailed case via direct expectation bounds.
Significance. If the stated conditions and derivations hold, the contribution is significant: it removes a restrictive bounded-domain hypothesis while retaining standard first-order methods, thereby widening the set of noise distributions for which convergence in expectation is provable. The proposed framework is presented as a reusable tool for heavy-tailed analyses and receives explicit credit for avoiding post-hoc restrictions or circular parameter definitions.
minor comments (2)
- [Abstract] Abstract: the precise moment index p and the exact regularity conditions (e.g., L-smoothness or strong convexity parameters) are invoked but not enumerated; adding one sentence listing them would improve immediate readability without altering the technical content.
- [Section 3] Notation: the definition of the mirror map and its associated Bregman divergence should be recalled in the statement of the main theorems (rather than only in the preliminaries) so that the dependence on the geometry is transparent.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the manuscript and for recommending acceptance. We are pleased that the contribution—new in-expectation convergence results for SMD, ASMD, SGD, and SGDM under heavy-tailed noise without bounded-domain assumptions or algorithmic modifications—is viewed as significant.
Circularity Check
No significant circularity; derivation relies on standard assumptions and direct bounds
full rationale
The manuscript presents convergence proofs for SMD/ASMD (convex) and SGD/SGDM (nonconvex) under finite p-moment noise (p in (1,2)). The required conditions—convexity or L-smoothness plus explicit noise-moment bounds—are stated explicitly at the outset and are the standard regularity conditions for the respective mirror-descent and momentum analyses. These assumptions are not defined in terms of the target convergence rates, nor are any parameters fitted to data and then relabeled as predictions. No load-bearing self-citation chain appears; the framework uses direct expectation recursions rather than ansatzes imported from prior author work or uniqueness theorems. The claims therefore remain independent of their own outputs and do not reduce by construction to the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The objective functions satisfy convexity (for SMD/ASMD) or appropriate smoothness (for SGD/SGDM) together with the finite p-moment condition on stochastic gradients for p in (1,2).
Reference graph
Works this paper leans on
-
[1]
Lower bounds for non-convex stochastic optimization
Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming , 199(1-2):165--214, 2023
2023
-
[2]
Linear attention is (maybe) all you need (to understand transformer optimization)
Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, and Suvrit Sra. Linear attention is (maybe) all you need (to understand transformer optimization). In The Twelfth International Conference on Learning Representations , 2024
2024
-
[3]
Uniformly convex and uniformly smooth convex functions
Dominique Az\'e and Jean-Paul Penot. Uniformly convex and uniformly smooth convex functions. Annales de la Facult\'e des sciences de Toulouse : Math\'ematiques , Ser. 6, 4(4):705--730, 1995
1995
-
[4]
High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise
Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors, Proceedings of The 28th International Conference on Artificial...
2025
-
[5]
On linear convergence of non-euclidean gradient methods without strong convexity and lipschitz gradient continuity
Heinz H Bauschke, J \'e r \^o me Bolte, Jiawei Chen, Marc Teboulle, and Xianfu Wang. On linear convergence of non-euclidean gradient methods without strong convexity and lipschitz gradient continuity. Journal of Optimization Theory and Applications , 182(3):1068--1087, 2019
2019
-
[6]
Bauschke, J\' e r\^ o me Bolte, and Marc Teboulle
Heinz H. Bauschke, J\' e r\^ o me Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research , 42(2):330--348, 2017
2017
-
[7]
Curtis, and Jorge Nocedal
L\' e on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review , 60(2):223--311, 2018
2018
-
[8]
Mirror descent and nonlinear projected subgradient methods for convex optimization
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters , 31(3):167--175, 2003
2003
-
[9]
Revisiting the noise model of stochastic gradient descent
Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , volume 238 of Proceedings of Machine Learning Research , pages 4780--4788. PMLR, 02--04 May 2024
2024
-
[10]
High-probability bounds for non-convex stochastic optimization with heavy tails
Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems , 34:4883--4895, 2021
2021
-
[11]
Composite objective mirror descent
John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT , volume 10, pages 14--26. Citeseer, 2010
2010
-
[12]
Optimal complexity and certification of bregman first-order methods
Radu-Alexandru Dragomir, Adrien B Taylor, Alexandre d’Aspremont, and J \'e r \^o me Bolte. Optimal complexity and certification of bregman first-order methods. Mathematical Programming , 194(1):41--83, 2022
2022
-
[13]
Can sgd handle heavy-tailed noise? arXiv preprint arXiv:2508.04860 , 2025
Ilyas Fatkhullin, Florian H \"u bler, and Guanghui Lan. Can sgd handle heavy-tailed noise? arXiv preprint arXiv:2508.04860 , 2025
-
[14]
A study of condition numbers for first-order optimization
Charles Guille-Escuret, Manuela Girotti, Baptiste Goujaud, and Ioannis Mitliagkas. A study of condition numbers for first-order optimization. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , volume 130 of Proceedings of Machine Learning Research , pages 1261--1269...
2021
-
[15]
Global convergence of the heavy-ball method for convex optimization
Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC) , pages 310--315, 2015
2015
-
[16]
Stochastic first- and zeroth-order methods for nonconvex stochastic programming
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013
2013
-
[17]
David H Gutman and Javier F Pena. A unified framework for bregman proximal methods: subgradient, gradient, and accelerated gradient schemes. arXiv preprint arXiv:1812.10198 , 2018
-
[18]
High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise
Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horv\' a th, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt\' a rik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian We...
2024
-
[19]
On proximal policy optimization's heavy-tailed gradients
Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization's heavy-tailed gradients. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings ...
2021
-
[20]
From gradient clipping to normalization for heavy tailed sgd
Florian H \"u bler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors, Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , volume 258 of Proceedings of Machine Learning Research , pages 2413--2421. PM...
2025
-
[21]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
Krzysztof C. Kiwiel. Proximal minimization methods with generalized bregman functions. SIAM Journal on Control and Optimization , 35(4):1142--1168, 1997
1997
-
[23]
An optimal method for stochastic composite optimization
Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming , 133(1):365--397, 2012
2012
-
[24]
First-order and stochastic optimization methods for machine learning
Guanghui Lan. First-order and stochastic optimization methods for machine learning . Springer, 2020
2020
-
[25]
Freund, and Yurii Nesterov
Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization , 28(1):333--354, 2018
2018
-
[26]
An improved analysis of stochastic gradient descent with momentum
Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 18261--18271. Curran Associates, Inc., 2020
2020
-
[27]
Online convex optimization with heavy tails: Old algorithms, new regrets, and applications
Zijian Liu. Online convex optimization with heavy tails: Old algorithms, new regrets, and applications. arXiv preprint arXiv:2508.07473 , 2025
-
[28]
Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
Zijian Liu. Can adaptive gradient methods converge under heavy-tailed noise? a case study of adagrad. arXiv preprint arXiv:2605.18694 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis
Zijian Liu. Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis. In The Fourteenth International Conference on Learning Representations , 2026
2026
-
[30]
relative continuity
Haihao Lu. “relative continuity” for non-lipschitz nonsmooth convex optimization using stochastic (or deterministic) mirror descent. INFORMS Journal on Optimization , 1(4):288--303, 2019
2019
-
[31]
High-probability bound for non-smooth non-convex stochastic optimization with heavy tails
Langqi Liu, Yibo Wang, and Lijun Zhang. High-probability bound for non-smooth non-convex stochastic optimization with heavy tails. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning , volume 235 of Procee...
2024
-
[32]
Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation. arXiv preprint arXiv:2303.12277 , 2023
-
[33]
Revisiting the last-iterate convergence of stochastic gradient methods
Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations , 2024
2024
-
[34]
Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping
Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. In The Thirteenth International Conference on Learning Representations , 2025
2025
-
[35]
Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise
Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory , volume 195 of Proceedings of Machine Learning Research , pages 2266--2290. PMLR,...
2023
-
[36]
Minimization methods for nonsmooth convex and quasiconvex functions
Yurii E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon , 29(3):519--531, 1984
1984
-
[37]
Improved convergence in high probability of clipped gradient methods with heavy tailed noise
Ta Duy Nguyen, Thien H Nguyen, Alina Ene, and Huy Nguyen. Improved convergence in high probability of clipped gradient methods with heavy tailed noise. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 24191--24222. Curran Associates, Inc., 2023
2023
-
[38]
Linear convergence of first order methods for non-strongly convex optimization
Ion Necoara, Yu Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical programming , 175(1):69--107, 2019
2019
-
[39]
Problem complexity and method efficiency in optimization
Arkadi Nemirovski and David Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience , 1983
1983
-
[40]
Online Learning: A Modern Introduction Using Convex Optimization
Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[41]
Breaking the heavy-tailed noise barrier in stochastic optimization problems
Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , volume 238 of Proceedings of Machine Learning Resea...
2024
-
[42]
Best possible bounds of the von Bahr--Esseen type
Iosif Pinelis. Best possible bounds of the von Bahr--Esseen type . Annals of Functional Analysis , 6(4):1 -- 29, 2015
2015
-
[43]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning , volume 28 of Proceedings of Machine Learning Research , pages 1310--1318, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR
2013
-
[44]
B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics , 3(4):864--878, 1963
1963
-
[45]
B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4(5):1--17, 1964
1964
-
[46]
Boris T. Polyak. Introduction to optimization . New York, Optimization Software, 1987
1987
-
[47]
An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise
Daniela Angela Parletta, Andrea Paudice, and Saverio Salzo. An improved analysis of the clipped stochastic subgradient method under heavy-tailed noise. arXiv preprint arXiv:2410.00573 , 2024
-
[48]
A Stochastic Approximation Method
Herbert Robbins and Sutton Monro. A Stochastic Approximation Method . The Annals of Mathematical Statistics , 22(3):400 -- 407, 1951
1951
-
[49]
High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance
Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horv\' a th, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt\' a rik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and ...
2023
-
[50]
Revisiting gradient normalization and clipping for nonconvex sgd under heavy-tailed noise: Necessity, sufficiency, and acceleration
Tao Sun, Xinwang Liu, and Kun Yuan. Revisiting gradient normalization and clipping for nonconvex sgd under heavy-tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research , 26(237):1--42, 2025
2025
-
[51]
A tail-index analysis of stochastic gradient noise in deep neural networks
Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research , pages 5827--5837. PMLR, 09--15 Jun 2019
2019
-
[52]
Inequalities for the rth absolute moment of a sum of random variables, 1 r 2
Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random variables, 1 r 2 . The Annals of Mathematical Statistics , 36(1):299--303, 1965
1965
-
[53]
Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance
Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory , volume 178 of Proceedings of Machine Learning Research , pages...
2022
-
[54]
Closing the gap between the upper bound and lower bound of adam s iteration complexity
Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam s iteration complexity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 39006--39032. Curran Associates, Inc., 2023
2023
-
[55]
Convergence rates of stochastic gradient descent under infinite noise variance
Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , volume 34, pages 18866--18877. Curran Associates, I...
2021
-
[56]
On the lower bound of minimizing polyak-Łojasiewicz functions
Pengyun Yue, Cong Fang, and Zhouchen Lin. On the lower bound of minimizing polyak-Łojasiewicz functions. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory , volume 195 of Proceedings of Machine Learning Research , pages 2948--2968. PMLR, 12--15 Jul 2023
2023
-
[57]
Parameter-free regret in high probability with heavy tails
Jiujia Zhang and Ashok Cutkosky. Parameter-free regret in high probability with heavy tails. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 8000--8012. Curran Associates, Inc., 2022
2022
-
[58]
Proximal-like incremental aggregated gradient method with linear convergence under bregman distance growth conditions
Hui Zhang, Yu-Hong Dai, Lei Guo, and Wei Peng. Proximal-like incremental aggregated gradient method with linear convergence under bregman distance growth conditions. Mathematics of Operations Research , 46(1):61--81, 2021
2021
-
[59]
Exact convergence rate of the last iterate in subgradient methods
Moslem Zamani and Fran c ois Glineur. Exact convergence rate of the last iterate in subgradient methods. SIAM Journal on Optimization , 35(3):2182--2201, 2025
2025
-
[60]
Why are adaptive methods good for attention models? In H
Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 15383--15393. Curran Associates, Inc., 2020
2020
-
[61]
Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses
Yihan Zhou, Victor Sanches Portella, Mark Schmidt, and Nicholas Harvey. Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 15823--15833. Curran Associates, Inc., 2020
2020
-
[62]
Zǎlinescu
C. Zǎlinescu. On uniformly convex functions. Journal of Mathematical Analysis and Applications , 95(2):344--374, 1983
1983
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.