The Dual Averaging Power-Prox Method with Application to Heavy-Tail Incremental Gradient

Jeremy Rack; Sebastian U. Stich; Yuan Gao

arxiv: 2606.10110 · v1 · pith:KOGI6QXEnew · submitted 2026-06-08 · 🧮 math.OC

The Dual Averaging Power-Prox Method with Application to Heavy-Tail Incremental Gradient

Yuan Gao , Jeremy Rack , Sebastian U. Stich This is my paper

Pith reviewed 2026-06-27 15:16 UTC · model grok-4.3

classification 🧮 math.OC

keywords finite-sum optimizationincremental gradientsheavy-tailed noisedual averagingconvergence ratesstochastic gradient descentcomposite optimizationwithout-replacement sampling

0 comments

The pith

The dual averaging power-prox method achieves faster asymptotic convergence than i.i.d. SGD under cyclic incremental gradients with heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper focuses on finite-sum composite optimization when component gradients are accessed in fixed cyclic order and exhibit heavy-tailed noise with only a bounded q-th centralized moment at the optimum, for q between 1 and 2. This setup departs from classical SGD assumptions yet matches many practical training regimes. The authors introduce the Dual Averaging Power-Prox method and supply the first convergence analysis for incremental gradients in this heavy-tailed regime. Their analysis establishes that the new method attains a better asymptotic rate than the corresponding SGD procedure that draws components independently with replacement.

Core claim

Under the assumptions of cyclic (without-replacement) access to component gradients and bounded q-th centralized moments of those gradients at the optimum, the Dual Averaging Power-Prox method is the first algorithm shown to converge, and it does so at a strictly better asymptotic rate than the natural i.i.d. SGD counterpart.

What carries the argument

The Dual Averaging Power-Prox method, which performs dual averaging updates using a power-prox operator tailored to the heavy-tailed incremental setting.

If this is right

Convergence guarantees now exist for a class of incremental methods that classical bounded-variance theory could not cover.
The method improves the asymptotic rate relative to with-replacement SGD while retaining the same per-iteration cost.
Fixed cyclic passes over the data become theoretically preferable to random reshuffling under the stated moment condition.
The analysis supplies explicit dependence of the rate on the moment parameter q.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Cyclic ordering may become advantageous precisely when noise moments are limited, reversing the usual preference for randomization.
The same dual-averaging construction could be tested on other structured sampling patterns such as mini-batch without replacement.
Training loops that already traverse data in fixed order could adopt the power-prox update without changing the data pipeline.

Load-bearing premise

At the optimum the component gradients possess a finite q-th centralized moment for some q in (1,2].

What would settle it

An experiment in which the observed convergence rate of the power-prox method equals or falls below that of i.i.d. SGD on a problem whose component gradients have unbounded second moments at the optimum.

Figures

Figures reproduced from arXiv: 2606.10110 by Jeremy Rack, Sebastian U. Stich, Yuan Gao.

read the original abstract

We study finite-sum composite optimization under two departures from classical stochastic gradient descent theory that are central in practice: incremental gradient access and heavy-tailed gradient noise. Specifically, we consider fixed cyclic passes over component gradients and assume that, at the optimum, component gradients have a bounded $q$-th centralized moment for some $q\in(1,2]$. This setting is much closer to modern ML training practice than the assumptions used in classical SGD theory, yet its theoretical understanding remains limited. We propose a Dual Averaging Power-Prox method for incremental gradients and establish, to the best of our knowledge, the first convergence analysis in this regime. We further show that our method achieves a better asymptotic convergence rate than the corresponding SGD method with i.i.d. (with-replacement) sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first analysis of dual averaging for cyclic incremental gradients with q-moments bounded only at the optimum and claims a better rate than i.i.d. SGD, but the trajectory-control issue for q<2 looks like a real gap.

read the letter

The main new thing is the convergence analysis for this exact mix of fixed cyclic passes and heavy-tailed component gradients where only the q-th centralized moment is bounded at the optimum for q in (1,2]. They introduce a Dual Averaging Power-Prox method and state that it improves the asymptotic rate over the corresponding with-replacement SGD. That rate comparison is presented as a direct consequence of the analysis and is the part that would matter most to people working on non-standard noise models.

The setting itself is closer to practice than classical bounded-variance SGD, so the attempt to handle incremental access plus infinite variance is worth doing. The method choice and the focus on asymptotic rates rather than finite-time bounds are reasonable moves for this regime.

The soft spot is the moment assumption. When q is less than 2 the noise has infinite variance, and bounding the moment only at x* does not automatically keep the sequence inside a region where the bound applies. Large deviations can send iterates outside any fixed neighborhood, which would break both the rate derivation and the claimed advantage over i.i.d. sampling. The abstract gives no indication that an auxiliary argument for trajectory control or neighborhood extension is supplied, so this needs to be checked line-by-line in the proofs. If that control is missing, the central claim does not hold.

The paper is aimed at readers who already work on heavy-tailed or incremental optimization and want to see rates under weaker moment conditions. It is worth a serious referee to verify whether the proofs close the gap the stress test flags; the topic is relevant enough that the derivations should be examined even if revisions are needed.

Referee Report

1 major / 0 minor

Summary. The paper proposes the Dual Averaging Power-Prox method for finite-sum composite optimization under cyclic incremental gradient access and heavy-tailed noise, where component gradients satisfy a bounded q-th centralized moment (q ∈ (1,2]) only at the optimum. It claims to deliver the first convergence analysis in this regime and a strictly better asymptotic rate than the corresponding i.i.d. SGD method.

Significance. If the analysis is complete, the result would supply the first rigorous rates for a practical incremental method under moment assumptions weaker than finite variance, directly relevant to modern ML training. The claimed rate improvement over with-replacement SGD would be a notable theoretical distinction.

major comments (1)

[Abstract (and wherever the main assumption is formalized)] The assumption that component gradients have bounded q-th centralized moment is stated only at the optimum x*. For q ∈ (1,2] the noise has infinite variance; without an auxiliary argument showing that the iterates remain in a controlled neighborhood (or that the moment bound extends to a neighborhood of x*), large deviations can send the sequence outside any region where the bound applies. This directly affects both the rate derivation and the comparison to i.i.d. SGD. The abstract gives no indication that such control is established.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's careful reading of our manuscript and the constructive comment regarding the moment assumption. Below we provide our response and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract (and wherever the main assumption is formalized)] The assumption that component gradients have bounded q-th centralized moment is stated only at the optimum x*. For q ∈ (1,2] the noise has infinite variance; without an auxiliary argument showing that the iterates remain in a controlled neighborhood (or that the moment bound extends to a neighborhood of x*), large deviations can send the sequence outside any region where the bound applies. This directly affects both the rate derivation and the comparison to i.i.d. SGD. The abstract gives no indication that such control is established.

Authors: We agree that the q-th moment bound is stated only at x* and that the abstract provides no indication of iterate control. For q ∈ (1,2] this raises a legitimate technical issue, as the analysis requires the bound to apply along the trajectory. The manuscript's convergence proof (Theorem 3.1 and supporting lemmas) proceeds under the stated assumption at x*, but does not contain an explicit auxiliary argument establishing that the dual-averaging proximal updates keep iterates in a suitable neighborhood with high probability. We will therefore add this argument in the revision, update the abstract to note the established control, and insert a clarifying remark after the assumption statement. The same clarification will be applied to the i.i.d. SGD comparison. This constitutes a substantive but contained revision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained from stated assumptions

full rationale

The paper states an assumption (bounded q-th centralized moment of component gradients at the optimum) and derives convergence rates for the proposed Dual Averaging Power-Prox method under incremental gradient access. The claimed better asymptotic rate versus i.i.d. SGD is presented as a direct consequence of the analysis under this assumption. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain relies on standard optimization analysis techniques applied to the given moment condition and does not reduce to its inputs by construction. External benchmarks or machine-checked results are not needed here as the central claims remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full details on any free parameters or additional axioms are unavailable. The central modeling choice is the moment bound.

axioms (1)

domain assumption component gradients have a bounded q-th centralized moment for some q∈(1,2] at the optimum
Explicitly stated as the noise model in the abstract.

pith-pipeline@v0.9.1-grok · 5662 in / 1156 out tokens · 26247 ms · 2026-06-27T15:16:55.975504+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Towards Weaker Variance Assumptions for Stochastic Optimization

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page internal anchor Pith review arXiv 2025
[2]

Nonlinear stochastic gradient descent and heavy-tailed noise: A unified framework and high-probability guarantees.arXiv preprint arXiv:2410.13954, 2024

Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Nonlinear stochastic gradient descent and heavy-tailed noise: A unified framework and high-probability guarantees.arXiv preprint arXiv:2410.13954, 2024

work page arXiv 2024
[3]

Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization.arXiv preprint arXiv:2507.09093, 2025

Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization.arXiv preprint arXiv:2507.09093, 2025

work page arXiv 2025
[4]

Revisiting the noise model of stochastic gradient descent

Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024

2024
[5]

Practical recommendations for gradient-based training of deep architectures

Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012

2012
[6]

arXiv preprint arXiv:2306.12498 , year=

Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled sgd: A primal-dual perspective and improved bounds.arXiv preprint arXiv:2306.12498, 2023

work page arXiv 2023
[7]

Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed

Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, and Eduard Gorbunov. Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed. InInternational Conference on Machine Learning, pages 10269–10333. PMLR, 2025

2025
[8]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

2023
[9]

Proximal splitting methods in signal processing

Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. InFixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011

2011
[10]

Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025

Ilyas Fatkhullin, Florian Hübler, and Guanghui Lan. Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025. 11

2025
[11]

Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise

Adrien Fradin, Abdurakhmon Sadiev, Laurent Condat, and Peter Richtárik. Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026

2026
[12]

Composite optimization with error feedback: the dual averaging approach

Yuan Gao, Anton Rodomanov, Jeremy Rack, and Sebastian U Stich. Composite optimization with error feedback: the dual averaging approach. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PSmakC4sw5

2026
[13]

On proximal policy optimization’s heavy-tailed gradients

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization’s heavy-tailed gradients. InInternational Conference on Machine Learning, pages 3610–3619. PMLR, 2021

2021
[14]

Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013

2013
[15]

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016

2016
[16]

Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021

Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021

2021
[17]

Random shuffling beats sgd after finite epochs

Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. InInternational Conference on Machine Learning, pages 2624–2633. PMLR, 2019

2019
[18]

From gradient clipping to normalization for heavy tailed sgd

Florian Hübler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. InInternational Conference on Artificial Intelligence and Statistics, pages 2413–2421. PMLR, 2025

2025
[19]

Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024

Cedric Josz, Lexiao Lai, and Xiaopeng Li. Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024

work page arXiv 2024
[20]

Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020

Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020

2020
[21]

On convergence of incremental gradient for non-convex smooth functions

Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InInternational Conference on Machine Learning, pages 25058–25086. PMLR, 2024

2024
[22]

On convergence of incremental gradient for non-convex smooth functions

Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=ZRMQX6aTUS

2024
[23]

Sparse convolutional neural networks

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

2015
[24]

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu. Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis.arXiv preprint arXiv:2512.23178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning, pages 32471–32508. PMLR, 2024. 12

2024
[26]

Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[27]

Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018

Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018

2018
[28]

Russell Luke.Proximal Methods for Image Processing, pages 165–202

D. Russell Luke.Proximal Methods for Image Processing, pages 165–202. Springer International Publishing, Cham, 2020. ISBN 978-3-030-34413-9. doi: 10 .1007/978-3-030-34413-9_6. URL https://doi.org/10.1007/978-3-030-34413-9_6

work page doi:10.1007/978-3-030-34413-9_6 2020
[29]

Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020

2020
[30]

Proximal and federated random reshuffling

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. InInternational Conference on Machine Learning, pages 15718–15749. PMLR, 2022

2022
[31]

Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769. PMLR, 2022

2022
[32]

Convergence rate of incremental subgradient algorithms

Angelia Nedic and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. Applied Optimization, 54:223–264, 2001

2001
[33]

Problem complexity and method efficiency in optimization

Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983
[34]

A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021

Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021

2021
[35]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance

Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. InConference on Learning Theory, pages 65–102. PMLR, 2022

2022
[38]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026. A Auxiliary facts and technical lemmas We first present a simple fact regarding the uniform convexity of the power function, which is a key technical tool in our analysis. 13 Fact 1.The function ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Therefore: T−1X t=0 E[F(x t+1)−F ⋆] + γ 2 T−1X t=0 E h ∥xt+1 −x t∥2 i + λ p2p−2 T−1X t=0 E[∥x t+1 −x t∥p] ≤ γR2 0 2 + λRp 0 p + T−1X t=0 E[⟨g t − ∇f(x t),x t −x t+1⟩]. To upper bound the residual error, we use Young’s inequality: E[⟨g t − ∇f(x t),x t −x t+1⟩] =E[⟨∇f it(xt)− ∇f it(x⋆) +∇f(x ⋆)− ∇f(x t),x t −x t+1⟩] +E[⟨∇f it(x⋆)− ∇f(x ⋆),x t −x t+1⟩] (iv) ...

[1] [1]

Towards Weaker Variance Assumptions for Stochastic Optimization

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page internal anchor Pith review arXiv 2025

[2] [2]

Nonlinear stochastic gradient descent and heavy-tailed noise: A unified framework and high-probability guarantees.arXiv preprint arXiv:2410.13954, 2024

Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Nonlinear stochastic gradient descent and heavy-tailed noise: A unified framework and high-probability guarantees.arXiv preprint arXiv:2410.13954, 2024

work page arXiv 2024

[3] [3]

Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization.arXiv preprint arXiv:2507.09093, 2025

Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization.arXiv preprint arXiv:2507.09093, 2025

work page arXiv 2025

[4] [4]

Revisiting the noise model of stochastic gradient descent

Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024

2024

[5] [5]

Practical recommendations for gradient-based training of deep architectures

Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012

2012

[6] [6]

arXiv preprint arXiv:2306.12498 , year=

Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled sgd: A primal-dual perspective and improved bounds.arXiv preprint arXiv:2306.12498, 2023

work page arXiv 2023

[7] [7]

Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed

Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, and Eduard Gorbunov. Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed. InInternational Conference on Machine Learning, pages 10269–10333. PMLR, 2025

2025

[8] [8]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

2023

[9] [9]

Proximal splitting methods in signal processing

Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. InFixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011

2011

[10] [10]

Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025

Ilyas Fatkhullin, Florian Hübler, and Guanghui Lan. Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025. 11

2025

[11] [11]

Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise

Adrien Fradin, Abdurakhmon Sadiev, Laurent Condat, and Peter Richtárik. Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026

2026

[12] [12]

Composite optimization with error feedback: the dual averaging approach

Yuan Gao, Anton Rodomanov, Jeremy Rack, and Sebastian U Stich. Composite optimization with error feedback: the dual averaging approach. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PSmakC4sw5

2026

[13] [13]

On proximal policy optimization’s heavy-tailed gradients

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization’s heavy-tailed gradients. InInternational Conference on Machine Learning, pages 3610–3619. PMLR, 2021

2021

[14] [14]

Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013

2013

[15] [15]

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016

2016

[16] [16]

Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021

Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021

2021

[17] [17]

Random shuffling beats sgd after finite epochs

Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. InInternational Conference on Machine Learning, pages 2624–2633. PMLR, 2019

2019

[18] [18]

From gradient clipping to normalization for heavy tailed sgd

Florian Hübler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. InInternational Conference on Artificial Intelligence and Statistics, pages 2413–2421. PMLR, 2025

2025

[19] [19]

Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024

Cedric Josz, Lexiao Lai, and Xiaopeng Li. Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024

work page arXiv 2024

[20] [20]

Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020

Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020

2020

[21] [21]

On convergence of incremental gradient for non-convex smooth functions

Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InInternational Conference on Machine Learning, pages 25058–25086. PMLR, 2024

2024

[22] [22]

On convergence of incremental gradient for non-convex smooth functions

Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=ZRMQX6aTUS

2024

[23] [23]

Sparse convolutional neural networks

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

2015

[24] [24]

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu. Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis.arXiv preprint arXiv:2512.23178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

On the last-iterate convergence of shuffling gradient methods

Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning, pages 32471–32508. PMLR, 2024. 12

2024

[26] [26]

Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[27] [27]

Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018

Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018

2018

[28] [28]

Russell Luke.Proximal Methods for Image Processing, pages 165–202

D. Russell Luke.Proximal Methods for Image Processing, pages 165–202. Springer International Publishing, Cham, 2020. ISBN 978-3-030-34413-9. doi: 10 .1007/978-3-030-34413-9_6. URL https://doi.org/10.1007/978-3-030-34413-9_6

work page doi:10.1007/978-3-030-34413-9_6 2020

[29] [29]

Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020

2020

[30] [30]

Proximal and federated random reshuffling

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. InInternational Conference on Machine Learning, pages 15718–15749. PMLR, 2022

2022

[31] [31]

Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769. PMLR, 2022

2022

[32] [32]

Convergence rate of incremental subgradient algorithms

Angelia Nedic and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. Applied Optimization, 54:223–264, 2001

2001

[33] [33]

Problem complexity and method efficiency in optimization

Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983

[34] [34]

A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021

Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021

2021

[35] [35]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019

[36] [36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance

Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. InConference on Learning Theory, pages 65–102. PMLR, 2022

2022

[38] [38]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026. A Auxiliary facts and technical lemmas We first present a simple fact regarding the uniform convexity of the power function, which is a key technical tool in our analysis. 13 Fact 1.The function ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Therefore: T−1X t=0 E[F(x t+1)−F ⋆] + γ 2 T−1X t=0 E h ∥xt+1 −x t∥2 i + λ p2p−2 T−1X t=0 E[∥x t+1 −x t∥p] ≤ γR2 0 2 + λRp 0 p + T−1X t=0 E[⟨g t − ∇f(x t),x t −x t+1⟩]. To upper bound the residual error, we use Young’s inequality: E[⟨g t − ∇f(x t),x t −x t+1⟩] =E[⟨∇f it(xt)− ∇f it(x⋆) +∇f(x ⋆)− ∇f(x t),x t −x t+1⟩] +E[⟨∇f it(x⋆)− ∇f(x ⋆),x t −x t+1⟩] (iv) ...