The Dual Averaging Power-Prox Method with Application to Heavy-Tail Incremental Gradient
Pith reviewed 2026-06-27 15:16 UTC · model grok-4.3
The pith
The dual averaging power-prox method achieves faster asymptotic convergence than i.i.d. SGD under cyclic incremental gradients with heavy-tailed noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumptions of cyclic (without-replacement) access to component gradients and bounded q-th centralized moments of those gradients at the optimum, the Dual Averaging Power-Prox method is the first algorithm shown to converge, and it does so at a strictly better asymptotic rate than the natural i.i.d. SGD counterpart.
What carries the argument
The Dual Averaging Power-Prox method, which performs dual averaging updates using a power-prox operator tailored to the heavy-tailed incremental setting.
If this is right
- Convergence guarantees now exist for a class of incremental methods that classical bounded-variance theory could not cover.
- The method improves the asymptotic rate relative to with-replacement SGD while retaining the same per-iteration cost.
- Fixed cyclic passes over the data become theoretically preferable to random reshuffling under the stated moment condition.
- The analysis supplies explicit dependence of the rate on the moment parameter q.
Where Pith is reading between the lines
- Cyclic ordering may become advantageous precisely when noise moments are limited, reversing the usual preference for randomization.
- The same dual-averaging construction could be tested on other structured sampling patterns such as mini-batch without replacement.
- Training loops that already traverse data in fixed order could adopt the power-prox update without changing the data pipeline.
Load-bearing premise
At the optimum the component gradients possess a finite q-th centralized moment for some q in (1,2].
What would settle it
An experiment in which the observed convergence rate of the power-prox method equals or falls below that of i.i.d. SGD on a problem whose component gradients have unbounded second moments at the optimum.
Figures
read the original abstract
We study finite-sum composite optimization under two departures from classical stochastic gradient descent theory that are central in practice: incremental gradient access and heavy-tailed gradient noise. Specifically, we consider fixed cyclic passes over component gradients and assume that, at the optimum, component gradients have a bounded $q$-th centralized moment for some $q\in(1,2]$. This setting is much closer to modern ML training practice than the assumptions used in classical SGD theory, yet its theoretical understanding remains limited. We propose a Dual Averaging Power-Prox method for incremental gradients and establish, to the best of our knowledge, the first convergence analysis in this regime. We further show that our method achieves a better asymptotic convergence rate than the corresponding SGD method with i.i.d. (with-replacement) sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Dual Averaging Power-Prox method for finite-sum composite optimization under cyclic incremental gradient access and heavy-tailed noise, where component gradients satisfy a bounded q-th centralized moment (q ∈ (1,2]) only at the optimum. It claims to deliver the first convergence analysis in this regime and a strictly better asymptotic rate than the corresponding i.i.d. SGD method.
Significance. If the analysis is complete, the result would supply the first rigorous rates for a practical incremental method under moment assumptions weaker than finite variance, directly relevant to modern ML training. The claimed rate improvement over with-replacement SGD would be a notable theoretical distinction.
major comments (1)
- [Abstract (and wherever the main assumption is formalized)] The assumption that component gradients have bounded q-th centralized moment is stated only at the optimum x*. For q ∈ (1,2] the noise has infinite variance; without an auxiliary argument showing that the iterates remain in a controlled neighborhood (or that the moment bound extends to a neighborhood of x*), large deviations can send the sequence outside any region where the bound applies. This directly affects both the rate derivation and the comparison to i.i.d. SGD. The abstract gives no indication that such control is established.
Simulated Author's Rebuttal
We appreciate the referee's careful reading of our manuscript and the constructive comment regarding the moment assumption. Below we provide our response and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract (and wherever the main assumption is formalized)] The assumption that component gradients have bounded q-th centralized moment is stated only at the optimum x*. For q ∈ (1,2] the noise has infinite variance; without an auxiliary argument showing that the iterates remain in a controlled neighborhood (or that the moment bound extends to a neighborhood of x*), large deviations can send the sequence outside any region where the bound applies. This directly affects both the rate derivation and the comparison to i.i.d. SGD. The abstract gives no indication that such control is established.
Authors: We agree that the q-th moment bound is stated only at x* and that the abstract provides no indication of iterate control. For q ∈ (1,2] this raises a legitimate technical issue, as the analysis requires the bound to apply along the trajectory. The manuscript's convergence proof (Theorem 3.1 and supporting lemmas) proceeds under the stated assumption at x*, but does not contain an explicit auxiliary argument establishing that the dual-averaging proximal updates keep iterates in a suitable neighborhood with high probability. We will therefore add this argument in the revision, update the abstract to note the established control, and insert a clarifying remark after the assumption statement. The same clarification will be applied to the i.i.d. SGD comparison. This constitutes a substantive but contained revision. revision: yes
Circularity Check
No circularity: derivation is self-contained from stated assumptions
full rationale
The paper states an assumption (bounded q-th centralized moment of component gradients at the optimum) and derives convergence rates for the proposed Dual Averaging Power-Prox method under incremental gradient access. The claimed better asymptotic rate versus i.i.d. SGD is presented as a direct consequence of the analysis under this assumption. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain relies on standard optimization analysis techniques applied to the given moment condition and does not reduce to its inputs by construction. External benchmarks or machine-checked results are not needed here as the central claims remain independent of any circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption component gradients have a bounded q-th centralized moment for some q∈(1,2] at the optimum
Reference graph
Works this paper leans on
-
[1]
Towards Weaker Variance Assumptions for Stochastic Optimization
Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Nonlinear stochastic gradient descent and heavy-tailed noise: A unified framework and high-probability guarantees.arXiv preprint arXiv:2410.13954, 2024
-
[3]
Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization.arXiv preprint arXiv:2507.09093, 2025
-
[4]
Revisiting the noise model of stochastic gradient descent
Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024
2024
-
[5]
Practical recommendations for gradient-based training of deep architectures
Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012
2012
-
[6]
arXiv preprint arXiv:2306.12498 , year=
Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled sgd: A primal-dual perspective and improved bounds.arXiv preprint arXiv:2306.12498, 2023
-
[7]
Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed
Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, and Eduard Gorbunov. Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed. InInternational Conference on Machine Learning, pages 10269–10333. PMLR, 2025
2025
-
[8]
Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023
2023
-
[9]
Proximal splitting methods in signal processing
Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. InFixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011
2011
-
[10]
Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025
Ilyas Fatkhullin, Florian Hübler, and Guanghui Lan. Can sgd handle heavy-tailed noise? InOPT 2025: Optimization for Machine Learning, 2025. 11
2025
-
[11]
Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise
Adrien Fradin, Abdurakhmon Sadiev, Laurent Condat, and Peter Richtárik. Tight lower bounds and optimal algorithms for stochastic nonconvex optimization with heavy-tailed noise. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026
2026
-
[12]
Composite optimization with error feedback: the dual averaging approach
Yuan Gao, Anton Rodomanov, Jeremy Rack, and Sebastian U Stich. Composite optimization with error feedback: the dual averaging approach. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PSmakC4sw5
2026
-
[13]
On proximal policy optimization’s heavy-tailed gradients
Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization’s heavy-tailed gradients. InInternational Conference on Machine Learning, pages 3610–3619. PMLR, 2021
2021
-
[14]
Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013
Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013
2013
-
[15]
Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016
Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1): 267–305, 2016
2016
-
[16]
Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021
Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent.Mathematical Programming, 186(1):49–84, 2021
2021
-
[17]
Random shuffling beats sgd after finite epochs
Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. InInternational Conference on Machine Learning, pages 2624–2633. PMLR, 2019
2019
-
[18]
From gradient clipping to normalization for heavy tailed sgd
Florian Hübler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. InInternational Conference on Artificial Intelligence and Statistics, pages 2413–2421. PMLR, 2025
2025
-
[19]
Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024
Cedric Josz, Lexiao Lai, and Xiaopeng Li. Proximal random reshuffling under local lipschitz continuity.arXiv preprint arXiv:2408.07182, 2024
-
[20]
Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020
Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world.Transactions on Machine Learning Research, 2020
2020
-
[21]
On convergence of incremental gradient for non-convex smooth functions
Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InInternational Conference on Machine Learning, pages 25058–25086. PMLR, 2024
2024
-
[22]
On convergence of incremental gradient for non-convex smooth functions
Anastasia Koloskova, Nikita Doikov, Sebastian U Stich, and Martin Jaggi. On convergence of incremental gradient for non-convex smooth functions. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=ZRMQX6aTUS
2024
-
[23]
Sparse convolutional neural networks
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
2015
-
[24]
Zijian Liu. Clipped gradient methods for nonsmooth convex optimization under heavy-tailed noise: A refined analysis.arXiv preprint arXiv:2512.23178, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
On the last-iterate convergence of shuffling gradient methods
Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning, pages 32471–32508. PMLR, 2024. 12
2024
-
[26]
Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping
Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[27]
Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018
Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications.SIAM Journal on Optimization, 28(1):333–354, 2018
2018
-
[28]
Russell Luke.Proximal Methods for Image Processing, pages 165–202
D. Russell Luke.Proximal Methods for Image Processing, pages 165–202. Springer International Publishing, Cham, 2020. ISBN 978-3-030-34413-9. doi: 10 .1007/978-3-030-34413-9_6. URL https://doi.org/10.1007/978-3-030-34413-9_6
-
[29]
Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements.Advances in Neural Information Processing Systems, 33:17309–17320, 2020
2020
-
[30]
Proximal and federated random reshuffling
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. InInternational Conference on Machine Learning, pages 15718–15749. PMLR, 2022
2022
-
[31]
Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769
Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! InInternational Conference on Machine Learning, pages 15750–15769. PMLR, 2022
2022
-
[32]
Convergence rate of incremental subgradient algorithms
Angelia Nedic and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. Applied Optimization, 54:223–264, 2001
2001
-
[33]
Problem complexity and method efficiency in optimization
Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983
1983
-
[34]
A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021
Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods.Journal of Machine Learning Research, 22(207):1–44, 2021
2021
-
[35]
A tail-index analysis of stochastic gradient noise in deep neural networks
Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019
2019
-
[36]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance
Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. InConference on Learning Theory, pages 65–102. PMLR, 2022
2022
-
[38]
Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026. A Auxiliary facts and technical lemmas We first present a simple fact regarding the uniform convexity of the power function, which is a key technical tool in our analysis. 13 Fact 1.The function ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Therefore: T−1X t=0 E[F(x t+1)−F ⋆] + γ 2 T−1X t=0 E h ∥xt+1 −x t∥2 i + λ p2p−2 T−1X t=0 E[∥x t+1 −x t∥p] ≤ γR2 0 2 + λRp 0 p + T−1X t=0 E[⟨g t − ∇f(x t),x t −x t+1⟩]. To upper bound the residual error, we use Young’s inequality: E[⟨g t − ∇f(x t),x t −x t+1⟩] =E[⟨∇f it(xt)− ∇f it(x⋆) +∇f(x ⋆)− ∇f(x t),x t −x t+1⟩] +E[⟨∇f it(x⋆)− ∇f(x ⋆),x t −x t+1⟩] (iv) ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.