Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation

Abolfazl Hashemi; Antesh Upadhyay; Anuran Makur; M. Berk Sahin; Sang Bin Moon; Zhankun Luo

arxiv: 2605.15388 · v1 · pith:3JOTCEYRnew · submitted 2026-05-14 · 💻 cs.LG

Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation

Zhankun Luo , Antesh Upadhyay , M. Berk Sahin , Sang Bin Moon , Anuran Makur , Abolfazl Hashemi This is my paper

Pith reviewed 2026-05-19 16:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic optimizationvariance reductionhigh-probability boundsFreedman inequalitymirror descentexpectation constraintsoracle complexitymartingale concentration

0 comments

The pith

A unified recursion with memory, reset, and correction terms yields high-probability bounds for variance-reduced stochastic estimators in normed spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a recursive framework for stochastic variance-reduced estimation built from memory retention, a reset probability, and a correction term that accounts for changes in the iterate. This single structure recovers classical methods such as SPIDER, STORM, and PAGE, supports a bias-variance decomposition of estimation error, and motivates new second-order variants. The central result is a unified high-probability bound on the estimation error, obtained via a new dimension-free vector-valued Freedman inequality that applies to random sums of vector martingales in smooth normed spaces. The bound holds in both Euclidean and non-Euclidean geometries, including Banach spaces used by mirror descent. As a direct consequence the work establishes the first Õ(ε^{-3}) oracle-complexity guarantee for stochastic optimization problems that include expectation constraints, improving on the previous Õ(ε^{-4}) rate.

Core claim

The central claim is that the three-component recursive estimator satisfies a high-probability bound on its estimation error, proved using a newly introduced dimension-free vector-valued Freedman inequality for random sums of vector martingales in smooth normed spaces; this bound applies equally in Euclidean and Banach settings and directly produces improved high-probability oracle complexities for both unconstrained mirror-descent optimization and stochastic optimization with expectation constraints.

What carries the argument

The recursion defined by memory retention, reset probability, and iterate-movement correction, together with the dimension-free vector-valued Freedman inequality for vector martingale sums.

If this is right

The framework recovers classical estimators including momentum, SPIDER, STORM, and PAGE.
It motivates new second-order variants of variance-reduced estimators.
High-probability oracle complexities are obtained for unconstrained mirror-descent optimization with only logarithmic dependence on the confidence level.
The first Õ(ε^{-3}) oracle-complexity bounds are derived for stochastic optimization with expectation constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension-free character of the inequality suggests the bounds remain useful even when dimension grows without introducing extra logarithmic factors.
The same recursive structure and inequality may be applied to derive high-probability guarantees for other families of stochastic algorithms beyond variance reduction.
The extension to Banach spaces indicates possible use in optimization problems whose geometry is not captured by Euclidean or Hilbert-space norms.

Load-bearing premise

The newly introduced dimension-free vector-valued Freedman inequality holds for random sums of vector martingales in smooth normed spaces, including Banach spaces.

What would settle it

A concrete counterexample consisting of a vector martingale sequence in a smooth normed space for which the claimed high-probability bound on the sum is violated would disprove the central result.

read the original abstract

Stochastic estimators are fundamental to large-scale optimization, where population quantities must be inferred from noisy oracle observations. Although influential methods such as momentum, SPIDER, STORM, and PAGE have been highly successful, their analyses are largely estimator-specific and expectation-based, obscuring the structural tradeoffs that determine reliability. In this paper, we develop a unified framework for stochastic variance-reduced estimation based on a recursion with three components: memory retention, reset probability, and a correction term for iterate movement. This framework recovers several classical estimators, motivates new second-order variants, and yields a bias-variance decomposition of estimation error. Our main result is a unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales. The result applies in both Euclidean and non-Euclidean settings, including the analysis of mirror-descent-based methods in Banach spaces. As applications, we obtain high-probability oracle complexities for unconstrained optimization with mirror descent, establishing the logarithmic dependence on the confidence level. We also derive the first $\tilde{\mathcal{O}}(\varepsilon^{-3})$ oracle-complexity bounds for stochastic optimization with expectation constraints, improving upon the existing $\tilde{\mathcal{O}}(\varepsilon^{-4})$ complexity by leveraging variance-reduced estimation for the first time in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies several variance-reduced estimators under one recursion and claims a new dimension-free vector-valued Freedman inequality that yields the first Õ(ε^{-3}) high-probability complexity for stochastic optimization with expectation constraints.

read the letter

The main point is a unified recursion with memory retention, reset probability, and a correction term that recovers momentum, SPIDER, STORM, and PAGE while also suggesting second-order variants. From this they extract a bias-variance decomposition and then prove a single high-probability bound that works in smooth normed spaces, including Banach spaces for mirror descent. The concrete payoff is the claimed first Õ(ε^{-3}) oracle complexity for expectation-constrained stochastic optimization, improving on the prior Õ(ε^{-4}) result, plus logarithmic dependence on the failure probability for the unconstrained mirror-descent case. That is the part worth noting: one framework, one inequality, and a measurable complexity gain in a setting where variance reduction had not been applied before. The framework itself looks clean and avoids the usual estimator-by-estimator proofs, which is useful for anyone who wants to try new variants without re-deriving everything. The soft spot is exactly the new inequality. The entire high-probability story rests on it being dimension-free and free of extra factors from the modulus of smoothness or the martingale structure. If the proof introduces hidden dependence on dimension or requires stronger assumptions than stated, the Õ(ε^{-3}) claim and the non-Euclidean extension both weaken. The abstract does not show the derivation, so that is the section that needs the closest read. The citation pattern is straightforward and the work is self-contained on its own terms. This is for people working on high-probability analysis of stochastic methods or on non-Euclidean optimization. It is not a routine extension, so it deserves a serious referee to check the inequality and the tightness of the resulting constants.

Referee Report

2 major / 2 minor

Summary. The paper develops a unified recursion framework for stochastic variance-reduced estimation consisting of memory retention, reset probability, and a correction term for iterate movement. This recovers classical estimators (momentum, SPIDER, STORM, PAGE), motivates new second-order variants, and yields a bias-variance decomposition. The central result is a unified high-probability bound obtained via a new dimension-free vector-valued Freedman inequality for random sums of vector martingales in smooth normed spaces (including Banach spaces). Applications give high-probability oracle complexities for mirror-descent unconstrained optimization (with logarithmic dependence on confidence level) and the first Õ(ε^{-3}) oracle complexity for stochastic optimization with expectation constraints, improving on prior Õ(ε^{-4}) bounds.

Significance. If the new vector-valued Freedman inequality holds with the claimed dimension-free property in smooth normed spaces, the work provides a valuable structural unification of high-probability analyses for variance-reduced methods, extending beyond Euclidean settings to mirror descent in Banach spaces. The bias-variance decomposition and recovery of multiple estimators clarify trade-offs, while the improved complexity for expectation-constrained problems is a concrete advance. The framework's generality could influence subsequent analyses in stochastic optimization.

major comments (2)

[§4] §4 (Theorem 4.1, the dimension-free vector-valued Freedman inequality): the proof must explicitly confirm that the tail bound remains free of hidden dependence on the modulus of smoothness or separability assumptions when applied to the non-Euclidean norms arising in mirror descent; otherwise the downstream Õ(ε^{-3}) claim for expectation-constrained optimization (Theorem 5.4) cannot be guaranteed without additional regularity not controlled in the final expression.
[§5.3] §5.3 (oracle complexity for expectation-constrained problems): the reduction from the unified high-probability bound to the Õ(ε^{-3}) rate assumes the martingale increments satisfy the exact conditions of the new inequality without extra logarithmic factors from the constraint functions; this step should be expanded with an explicit error propagation argument to substantiate the improvement over prior Õ(ε^{-4}) results.

minor comments (2)

[§3] Notation for the reset probability and correction term in the unified recursion (Definition 3.1) should be aligned with the bias-variance decomposition in Lemma 3.2 to avoid ambiguity when recovering SPIDER and STORM.
[§6] Figure 1 (comparison of estimators) lacks error bars or explicit post-hoc selection details; adding these would strengthen the empirical support for the high-probability claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and valuable feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. We will make the suggested clarifications in the revised version.

read point-by-point responses

Referee: [§4] §4 (Theorem 4.1, the dimension-free vector-valued Freedman inequality): the proof must explicitly confirm that the tail bound remains free of hidden dependence on the modulus of smoothness or separability assumptions when applied to the non-Euclidean norms arising in mirror descent; otherwise the downstream Õ(ε^{-3}) claim for expectation-constrained optimization (Theorem 5.4) cannot be guaranteed without additional regularity not controlled in the final expression.

Authors: We thank the referee for highlighting this important aspect of the proof. The vector-valued Freedman inequality in Theorem 4.1 is derived in a manner that ensures the tail bound is dimension-free and does not depend on the modulus of smoothness or separability assumptions beyond the standard smoothness of the norm, which is already part of the setup for mirror descent in Banach spaces. The constants in the bound are uniform and independent of these parameters in the final expression. To address the concern explicitly, we will revise the proof in the appendix to include a dedicated paragraph confirming this independence and its implications for the mirror descent application. This will also strengthen the guarantee for the complexity in Theorem 5.4. revision: yes
Referee: [§5.3] §5.3 (oracle complexity for expectation-constrained problems): the reduction from the unified high-probability bound to the Õ(ε^{-3}) rate assumes the martingale increments satisfy the exact conditions of the new inequality without extra logarithmic factors from the constraint functions; this step should be expanded with an explicit error propagation argument to substantiate the improvement over prior Õ(ε^{-4}) results.

Authors: We agree that an explicit error propagation argument would improve the clarity of the reduction in Section 5.3. In the revised manuscript, we will add a detailed subsection or paragraph that propagates the errors from the constraint functions through the high-probability bound, demonstrating that no additional logarithmic factors are introduced beyond those already present in the unified bound. This will rigorously justify the Õ(ε^{-3}) oracle complexity and highlight the improvement over previous Õ(ε^{-4}) results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; new inequality introduced independently

full rationale

The paper introduces a recursion for variance-reduced estimators and derives a unified high-probability bound by applying a newly stated dimension-free vector-valued Freedman inequality for martingales in smooth normed spaces. This inequality is positioned as an independent technical contribution that enables the bound and downstream oracle complexities (including the Õ(ε^{-3}) result). No quoted equations or steps reduce the main result to a fitted parameter, a self-citation chain, or a definitional renaming; the framework recovers known estimators via the recursion but does not force the high-probability claim by construction. The derivation is therefore self-contained against external benchmarks once the new inequality is accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; the framework relies on standard smoothness and normed-space assumptions plus the new inequality as a mathematical tool.

axioms (1)

domain assumption Smoothness assumptions in normed spaces for the vector martingale analysis
Invoked to ensure the new Freedman inequality applies to both Euclidean and Banach-space mirror-descent settings.

pith-pipeline@v0.9.0 · 5784 in / 1227 out tokens · 44254 ms · 2026-05-19T16:34:21.800788+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three components: memory retention, reset probability, and a correction term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

155 extracted references · 155 canonical work pages · 6 internal anchors

[1]

Global update tracking: A decentralized learning algorithm for heterogeneous data.Advances in neural information processing systems, 36:48939–48961, 2023

Sai Aparna Aketi, Abolfazl Hashemi, and Kaushik Roy. Global update tracking: A decentralized learning algorithm for heterogeneous data.Advances in neural information processing systems, 36:48939–48961, 2023

work page 2023
[2]

Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page arXiv 2025
[3]

On modification of an adaptive stochastic mirror descent algorithm for convex opti- mization problems with functional constraints

Mohammad S Alkousa. On modification of an adaptive stochastic mirror descent algorithm for convex opti- mization problems with functional constraints. InComputational Mathematics and Applications, pages 47–63. Springer, 2020

work page 2020
[4]

Neon2: Finding local minima via first-order oracles

Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. InAdvances in Neural Information Processing Systems, pages 3716–3726, 2018

work page 2018
[5]

Springer, Cham, Switzerland, second edition, 2021

Luigi Ambrosio, Elia Brué, and Daniele Semola.Lectures on optimal transport. Springer, Cham, Switzerland, second edition, 2021

work page 2021
[6]

Second-order information in non-convex stochastic optimization: Power and limitations

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. InConference on Learning Theory, pages 242–299. PMLR, 2020

work page 2020
[7]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, pages 1–50, 2022

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, pages 1–50, 2022

work page 2022
[8]

Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity.SIAM Journal on Optimization, 29(3):2257–2290, 2019

Hilal Asi and John C Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity.SIAM Journal on Optimization, 29(3):2257–2290, 2019

work page 2019
[9]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011

work page 2011
[10]

Federated learning review: Fundamentals, enabling technologies, and future applications.Information processing & management, 59(6):103061, 2022

Syreen Banabilah, Moayad Aloqaily, Eitaa Alsayed, Nida Malik, and Yaser Jararweh. Federated learning review: Fundamentals, enabling technologies, and future applications.Information processing & management, 59(6):103061, 2022

work page 2022
[11]

Springer Science & Business Media, 2012

Viorel Barbu and Teodor Precupanu.Convexity and optimization in Banach spaces. Springer Science & Business Media, 2012

work page 2012
[12]

Bregman monotone optimization algorithms

Heinz H Bauschke, Jonathan M Borwein, and Patrick L Combettes. Bregman monotone optimization algorithms. SIAM Journal on control and optimization, 42(2):596–636, 2003

work page 2003
[13]

Mirror descent and convex optimization problems with non-smooth inequality constraints

Anastasia Bayandina, Pavel Dvurechensky, Alexander Gasnikov, Fedor Stonyakin, and Alexander Titov. Mirror descent and convex optimization problems with non-smooth inequality constraints. InLarge-scale and distributed optimization, pages 181–213. Springer, 2018

work page 2018
[14]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003
[15]

Academic press, 2014

Dimitri P Bertsekas.Constrained optimization and Lagrange multiplier methods. Academic press, 2014

work page 2014
[16]

Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

work page 1954
[17]

Optimal primal-dual algorithm with last iterate convergence guarantees for stochastic convex optimization problems.arXiv preprint arXiv:2410.18513, 2024

Digvijay Boob and Mohammad Khalafi. Optimal primal-dual algorithm with last iterate convergence guarantees for stochastic convex optimization problems.arXiv preprint arXiv:2410.18513, 2024

work page arXiv 2024
[18]

Variance reduction.Wiley statsRef: Statistics reference online, pages 1–6, 2017

Zdravko Botev and Ad Ridder. Variance reduction.Wiley statsRef: Statistics reference online, pages 1–6, 2017

work page 2017
[19]

Optimization methods for large-scale learning.SIAM Review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning.SIAM Review, 60(2):223–311, 2018. 10

work page 2018
[20]

Oxford University Press, Oxford, UK, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, UK, 2013

work page 2013
[21]

Springer, 2011

Haim Brezis and Haim Brézis.Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011

work page 2011
[22]

A proximal point method for the variational inequality problem in banach spaces.SIAM Journal on Control and Optimization, 39(5):1633–1649, 2000

Regina Sandra Burachik and Susana Scheimberg. A proximal point method for the variational inequality problem in banach spaces.SIAM Journal on Control and Optimization, 39(5):1633–1649, 2000

work page 2000
[23]

Making sgd parameter-free

Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on Learning Theory, pages 2360–2389. PMLR, 2022

work page 2022
[24]

A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of mathematical imaging and vision, 40:120–145, 2011

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of mathematical imaging and vision, 40:120–145, 2011

work page 2011
[25]

Federated learning empowered recommendation model for financial consumer services.IEEE Transactions on Consumer Electronics, 70(1):2508–2516, 2023

Pushpita Chatterjee, Debashis Das, and Danda B Rawat. Federated learning empowered recommendation model for financial consumer services.IEEE Transactions on Consumer Electronics, 70(1):2508–2516, 2023

work page 2023
[26]

Fednmut–federated noisy model update tracking convergence analysis.arXiv preprint arXiv:2403.13247, 2024

Vishnu Pandi Chellapandi, Antesh Upadhyay, Abolfazl Hashemi, and Stanislaw H ˙Zak. Fednmut–federated noisy model update tracking convergence analysis.arXiv preprint arXiv:2403.13247, 2024

work page arXiv 2024
[27]

Communication-efficient variance-reduced decentralized stochastic optimization over time-varying directed graphs.IEEE Transactions on Automatic Control, 2021

Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Communication-efficient variance-reduced decentralized stochastic optimization over time-varying directed graphs.IEEE Transactions on Automatic Control, 2021

work page 2021
[28]

Cosson, A

R. Cosson, A. Jadbabaie, A. Makur, A. Reisizadeh, and D. Shah. Gradient descent with low-rank objective functions. InProceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 3309–3314, Singapore, December 13–15 2023

work page 2023
[29]

Cosson, A

R. Cosson, A. Jadbabaie, A. Makur, A. Reisizadeh, and D. Shah. Low-rank gradient descent.IEEE Open Journal of Control Systems, 2:380–395, October 2023

work page 2023
[30]

Shisheng Cui and Uday V Shanbhag. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems.Set-Valued and Variational Analysis, 29:453–499, 2021

work page 2021
[31]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

work page 2020
[32]

High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

work page 2021
[33]

Momentum-based variance reduction in non-convex sgd.Advances in neural information processing systems, 32, 2019

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd.Advances in neural information processing systems, 32, 2019

work page 2019
[34]

Faster non-convex federated learning via global and local momentum

Rudrajit Das, Anish Acharya, Abolfazl Hashemi, Sujay Sanghavi, Inderjit S Dhillon, and Ufuk Topcu. Faster non-convex federated learning via global and local momentum. InUncertainty in Artificial Intelligence, pages 496–506. PMLR, 2022

work page 2022
[35]

Stochastic model-based minimization of weakly convex functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019
[36]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

work page 2014
[37]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. InInternational Conference on Machine Learning, pages 7449–7479. PMLR, 2023

work page 2023
[38]

Provable convergence guarantees for black-box variational inference

Justin Domke, Robert Gower, and Guillaume Garrigos. Provable convergence guarantees for black-box variational inference. InAdvances in Neural Information Processing Systems, volume 36, pages 66289–66327, 2023. 11

work page 2023
[39]

The complexity of finding stationary points with stochastic gradient descent.arXiv preprint arXiv:1910.01845, 2019

Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent.arXiv preprint arXiv:1910.01845, 2019

work page arXiv 1910
[40]

Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

work page 2019
[41]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

work page 2011
[42]

SIAM, 1999

Ivar Ekeland and Roger Temam.Convex analysis and variational problems. SIAM, 1999

work page 1999
[43]

Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.Advances in neural information processing systems, 31, 2018

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.Advances in neural information processing systems, 31, 2018

work page 2018
[44]

Sharp analysis for nonconvex SGD escaping from saddle points

Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex SGD escaping from saddle points. In Conference on Learning Theory, pages 1192–1234. PMLR, 2019

work page 2019
[45]

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

Arda Fazla, Ege C. Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored SGD for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Stochastic first-and zeroth-order methods for nonconvex stochastic program- ming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic program- ming.SIAM journal on optimization, 23(4):2341–2368, 2013

work page 2013
[47]

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

work page 2016
[48]

On stochastic approximation.Theory of Probability & Its Applications, 10(2):275–278, 1965

EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2):275–278, 1965

work page 1965
[49]

High- probability complexity bounds for non-smooth stochastic convex optimization with heavy-tailed noise.Journal of Optimization Theory and Applications, 203(3):2679–2738, 2024

Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, and Alexander Gasnikov. High- probability complexity bounds for non-smooth stochastic convex optimization with heavy-tailed noise.Journal of Optimization Theory and Applications, 203(3):2679–2738, 2024

work page 2024
[50]

A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020

work page 2020
[51]

High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise.arXiv preprint arXiv:2310.01860, 2023

Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise.arXiv preprint arXiv:2310.01860, 2023

work page arXiv 2023
[52]

Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

work page 2019
[53]

Disa: A dual inexact splitting algorithm for distributed convex composite optimization.IEEE Transactions on Automatic Control, 69(5):2995–3010, 2023

Luyao Guo, Xinli Shi, Shaofu Yang, and Jinde Cao. Disa: A dual inexact splitting algorithm for distributed convex composite optimization.IEEE Transactions on Automatic Control, 69(5):2995–3010, 2023

work page 2023
[54]

The heavy-tail phenomenon in SGD

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in SGD. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

work page 2021
[55]

Comparison of optimization techniques based on gradient descent algorithm: A review.PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4):2715–2743, 2021

Saad Hikmat Haji and Adnan Mohsin Abdulazeez. Comparison of optimization techniques based on gradient descent algorithm: A review.PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4):2715–2743, 2021

work page 2021
[56]

A primal-dual algorithm with line search for general convex-concave saddle point problems.SIAM Journal on Optimization, 31(2):1299–1329, 2021

Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm with line search for general convex-concave saddle point problems.SIAM Journal on Optimization, 31(2):1299–1329, 2021

work page 2021
[57]

J. M. Hammersley and J. G. Mauldon. General principles of antithetic variates.Mathematical Proceedings of the Cambridge Philosophical Society, 52(3):476–481, July 1956

work page 1956
[58]

J. M. Hammersley and K. W. Morton. A new Monte Carlo technique: Antithetic variates.Mathematical Proceedings of the Cambridge Philosophical Society, 52(3):449–475, July 1956. 12

work page 1956
[59]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019

work page 2019
[60]

Hashemi, D

A. Hashemi, D. Lee, and A. Makur. Strong antithetic variates: Theory and applications. Preprint, pp. 1–13, 2025

work page 2025
[61]

A unified model for large-scale inexact fixed-point iteration: A stochastic optimization perspective.IEEE Transactions on Automatic Control, 70(4):2435–2449, 2024

Abolfazl Hashemi. A unified model for large-scale inexact fixed-point iteration: A stochastic optimization perspective.IEEE Transactions on Automatic Control, 70(4):2435–2449, 2024

work page 2024
[62]

RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

Abolfazl Hashemi. Rampage: Randomized mid-point for debiased gradient extrapolation.arXiv preprint arXiv:2603.22155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Strong antithetic variance reduction inequalities

Abolfazl Hashemi, Dogmin Lee, and Anuran Makur. Strong antithetic variance reduction inequalities. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pages 1–6, Ann Arbor, MI, USA, June 22–27 2025

work page 2025
[64]

Resilient constrained learning.Advances in Neural Information Processing Systems, 36:71767–71798, 2023

Ignacio Hounie, Alejandro Ribeiro, and Luiz FO Chamon. Resilient constrained learning.Advances in Neural Information Processing Systems, 36:71767–71798, 2023

work page 2023
[65]

Yankun Huang and Qihang Lin. Oracle complexity of single-loop switching subgradient methods for non- smooth weakly convex functional constrained optimization.Advances in Neural Information Processing Systems, 36:61327–61340, 2023

work page 2023
[66]

Distributed proximal algorithms for nonsmooth optimiza- tion: Unified convergence analysis.IEEE Transactions on Automatic Control, 2025

Yi Huang, Shisheng Cui, Jian Sun, and Ziyang Meng. Distributed proximal algorithms for nonsmooth optimiza- tion: Unified convergence analysis.IEEE Transactions on Automatic Control, 2025

work page 2025
[67]

Distributed multiproximal algorithm for nonsmooth convex optimization with coupled inequality constraints.IEEE Transactions on Automatic Control, 68(12):8126–8133, 2023

Yi Huang, Ziyang Meng, Jian Sun, and Wei Ren. Distributed multiproximal algorithm for nonsmooth convex optimization with coupled inequality constraints.IEEE Transactions on Automatic Control, 68(12):8126–8133, 2023

work page 2023
[68]

Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

work page arXiv 2023
[69]

Safe-ef: Error feedback for non-smooth constrained optimiza- tion

Rustem Islamov, Yarden As, and Ilyas Fatkhullin. Safe-ef: Error feedback for non-smooth constrained optimiza- tion. InForty-second International Conference on Machine Learning, 2025

work page 2025
[70]

Safe-EF: Error feedback for non-smooth constrained optimiza- tion

Rustem Islamov, Yarden As, and Ilyas Fatkhullin. Safe-EF: Error feedback for non-smooth constrained optimiza- tion. InForty-second International Conference on Machine Learning, 2025

work page 2025
[71]

Unconstrained online learning with unbounded losses

Andrew Jacobsen and Ashok Cutkosky. Unconstrained online learning with unbounded losses. InInternational Conference on Machine Learning, pages 14590–14630. PMLR, 2023

work page 2023
[72]

Jadbabaie, A

A. Jadbabaie, A. Makur, and A. Reisizadeh. Adaptive low-rank gradient descent. InProceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 3315–3320, Singapore, December 13–15 2023

work page 2023
[73]

Jadbabaie, A

A. Jadbabaie, A. Makur, and D. Shah. Federated optimization of smooth loss functions.IEEE Transactions on Information Theory, 69(12):7836–7866, December 2023

work page 2023
[74]

Jadbabaie, A

A. Jadbabaie, A. Makur, and D. Shah. Gradient-based empirical risk minimization using local polynomial regression.Stochastic Systems, 14(4):363–402, December 2024

work page 2024
[75]

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018

work page 2018
[76]

First-order methods for nonsmooth nonconvex functional constrained optimization with or without slater points.SIAM Journal on Optimization, 35(2):1300–1329, 2025

Zhichao Jia and Benjamin Grimmer. First-order methods for nonsmooth nonconvex functional constrained optimization with or without slater points.SIAM Journal on Optimization, 35(2):1300–1329, 2025

work page 2025
[77]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013. 13

work page 2013
[78]

Large deviations of vector-valued martingales in 2-smooth normed spaces.arXiv preprint arXiv:0809.0813, 2008

Anatoli Juditsky and Arkadii S Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces.arXiv preprint arXiv:0809.0813, 2008

work page arXiv 2008
[79]

Kahn and A

H. Kahn and A. W. Marshall. Methods of reducing sample size in Monte Carlo computations.Journal of the Operations Research Society of America, 1(5):263–278, November 1953

work page 1953
[80]

Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

work page 2023

Showing first 80 references.

[1] [1]

Global update tracking: A decentralized learning algorithm for heterogeneous data.Advances in neural information processing systems, 36:48939–48961, 2023

Sai Aparna Aketi, Abolfazl Hashemi, and Kaushik Roy. Global update tracking: A decentralized learning algorithm for heterogeneous data.Advances in neural information processing systems, 36:48939–48961, 2023

work page 2023

[2] [2]

Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page arXiv 2025

[3] [3]

On modification of an adaptive stochastic mirror descent algorithm for convex opti- mization problems with functional constraints

Mohammad S Alkousa. On modification of an adaptive stochastic mirror descent algorithm for convex opti- mization problems with functional constraints. InComputational Mathematics and Applications, pages 47–63. Springer, 2020

work page 2020

[4] [4]

Neon2: Finding local minima via first-order oracles

Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. InAdvances in Neural Information Processing Systems, pages 3716–3726, 2018

work page 2018

[5] [5]

Springer, Cham, Switzerland, second edition, 2021

Luigi Ambrosio, Elia Brué, and Daniele Semola.Lectures on optimal transport. Springer, Cham, Switzerland, second edition, 2021

work page 2021

[6] [6]

Second-order information in non-convex stochastic optimization: Power and limitations

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. InConference on Learning Theory, pages 242–299. PMLR, 2020

work page 2020

[7] [7]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, pages 1–50, 2022

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, pages 1–50, 2022

work page 2022

[8] [8]

Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity.SIAM Journal on Optimization, 29(3):2257–2290, 2019

Hilal Asi and John C Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity.SIAM Journal on Optimization, 29(3):2257–2290, 2019

work page 2019

[9] [9]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011

work page 2011

[10] [10]

Federated learning review: Fundamentals, enabling technologies, and future applications.Information processing & management, 59(6):103061, 2022

Syreen Banabilah, Moayad Aloqaily, Eitaa Alsayed, Nida Malik, and Yaser Jararweh. Federated learning review: Fundamentals, enabling technologies, and future applications.Information processing & management, 59(6):103061, 2022

work page 2022

[11] [11]

Springer Science & Business Media, 2012

Viorel Barbu and Teodor Precupanu.Convexity and optimization in Banach spaces. Springer Science & Business Media, 2012

work page 2012

[12] [12]

Bregman monotone optimization algorithms

Heinz H Bauschke, Jonathan M Borwein, and Patrick L Combettes. Bregman monotone optimization algorithms. SIAM Journal on control and optimization, 42(2):596–636, 2003

work page 2003

[13] [13]

Mirror descent and convex optimization problems with non-smooth inequality constraints

Anastasia Bayandina, Pavel Dvurechensky, Alexander Gasnikov, Fedor Stonyakin, and Alexander Titov. Mirror descent and convex optimization problems with non-smooth inequality constraints. InLarge-scale and distributed optimization, pages 181–213. Springer, 2018

work page 2018

[14] [14]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003

[15] [15]

Academic press, 2014

Dimitri P Bertsekas.Constrained optimization and Lagrange multiplier methods. Academic press, 2014

work page 2014

[16] [16]

Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

work page 1954

[17] [17]

Optimal primal-dual algorithm with last iterate convergence guarantees for stochastic convex optimization problems.arXiv preprint arXiv:2410.18513, 2024

Digvijay Boob and Mohammad Khalafi. Optimal primal-dual algorithm with last iterate convergence guarantees for stochastic convex optimization problems.arXiv preprint arXiv:2410.18513, 2024

work page arXiv 2024

[18] [18]

Variance reduction.Wiley statsRef: Statistics reference online, pages 1–6, 2017

Zdravko Botev and Ad Ridder. Variance reduction.Wiley statsRef: Statistics reference online, pages 1–6, 2017

work page 2017

[19] [19]

Optimization methods for large-scale learning.SIAM Review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning.SIAM Review, 60(2):223–311, 2018. 10

work page 2018

[20] [20]

Oxford University Press, Oxford, UK, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, UK, 2013

work page 2013

[21] [21]

Springer, 2011

Haim Brezis and Haim Brézis.Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011

work page 2011

[22] [22]

A proximal point method for the variational inequality problem in banach spaces.SIAM Journal on Control and Optimization, 39(5):1633–1649, 2000

Regina Sandra Burachik and Susana Scheimberg. A proximal point method for the variational inequality problem in banach spaces.SIAM Journal on Control and Optimization, 39(5):1633–1649, 2000

work page 2000

[23] [23]

Making sgd parameter-free

Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on Learning Theory, pages 2360–2389. PMLR, 2022

work page 2022

[24] [24]

A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of mathematical imaging and vision, 40:120–145, 2011

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of mathematical imaging and vision, 40:120–145, 2011

work page 2011

[25] [25]

Federated learning empowered recommendation model for financial consumer services.IEEE Transactions on Consumer Electronics, 70(1):2508–2516, 2023

Pushpita Chatterjee, Debashis Das, and Danda B Rawat. Federated learning empowered recommendation model for financial consumer services.IEEE Transactions on Consumer Electronics, 70(1):2508–2516, 2023

work page 2023

[26] [26]

Fednmut–federated noisy model update tracking convergence analysis.arXiv preprint arXiv:2403.13247, 2024

Vishnu Pandi Chellapandi, Antesh Upadhyay, Abolfazl Hashemi, and Stanislaw H ˙Zak. Fednmut–federated noisy model update tracking convergence analysis.arXiv preprint arXiv:2403.13247, 2024

work page arXiv 2024

[27] [27]

Communication-efficient variance-reduced decentralized stochastic optimization over time-varying directed graphs.IEEE Transactions on Automatic Control, 2021

Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Communication-efficient variance-reduced decentralized stochastic optimization over time-varying directed graphs.IEEE Transactions on Automatic Control, 2021

work page 2021

[28] [28]

Cosson, A

R. Cosson, A. Jadbabaie, A. Makur, A. Reisizadeh, and D. Shah. Gradient descent with low-rank objective functions. InProceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 3309–3314, Singapore, December 13–15 2023

work page 2023

[29] [29]

Cosson, A

R. Cosson, A. Jadbabaie, A. Makur, A. Reisizadeh, and D. Shah. Low-rank gradient descent.IEEE Open Journal of Control Systems, 2:380–395, October 2023

work page 2023

[30] [30]

Shisheng Cui and Uday V Shanbhag. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems.Set-Valued and Variational Analysis, 29:453–499, 2021

work page 2021

[31] [31]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

work page 2020

[32] [32]

High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

work page 2021

[33] [33]

Momentum-based variance reduction in non-convex sgd.Advances in neural information processing systems, 32, 2019

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd.Advances in neural information processing systems, 32, 2019

work page 2019

[34] [34]

Faster non-convex federated learning via global and local momentum

Rudrajit Das, Anish Acharya, Abolfazl Hashemi, Sujay Sanghavi, Inderjit S Dhillon, and Ufuk Topcu. Faster non-convex federated learning via global and local momentum. InUncertainty in Artificial Intelligence, pages 496–506. PMLR, 2022

work page 2022

[35] [35]

Stochastic model-based minimization of weakly convex functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019

[36] [36]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

work page 2014

[37] [37]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. InInternational Conference on Machine Learning, pages 7449–7479. PMLR, 2023

work page 2023

[38] [38]

Provable convergence guarantees for black-box variational inference

Justin Domke, Robert Gower, and Guillaume Garrigos. Provable convergence guarantees for black-box variational inference. InAdvances in Neural Information Processing Systems, volume 36, pages 66289–66327, 2023. 11

work page 2023

[39] [39]

The complexity of finding stationary points with stochastic gradient descent.arXiv preprint arXiv:1910.01845, 2019

Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent.arXiv preprint arXiv:1910.01845, 2019

work page arXiv 1910

[40] [40]

Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

work page 2019

[41] [41]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

work page 2011

[42] [42]

SIAM, 1999

Ivar Ekeland and Roger Temam.Convex analysis and variational problems. SIAM, 1999

work page 1999

[43] [43]

Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.Advances in neural information processing systems, 31, 2018

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.Advances in neural information processing systems, 31, 2018

work page 2018

[44] [44]

Sharp analysis for nonconvex SGD escaping from saddle points

Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex SGD escaping from saddle points. In Conference on Learning Theory, pages 1192–1234. PMLR, 2019

work page 2019

[45] [45]

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

Arda Fazla, Ege C. Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored SGD for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Stochastic first-and zeroth-order methods for nonconvex stochastic program- ming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic program- ming.SIAM journal on optimization, 23(4):2341–2368, 2013

work page 2013

[47] [47]

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

work page 2016

[48] [48]

On stochastic approximation.Theory of Probability & Its Applications, 10(2):275–278, 1965

EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2):275–278, 1965

work page 1965

[49] [49]

High- probability complexity bounds for non-smooth stochastic convex optimization with heavy-tailed noise.Journal of Optimization Theory and Applications, 203(3):2679–2738, 2024

Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, and Alexander Gasnikov. High- probability complexity bounds for non-smooth stochastic convex optimization with heavy-tailed noise.Journal of Optimization Theory and Applications, 203(3):2679–2738, 2024

work page 2024

[50] [50]

A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020

work page 2020

[51] [51]

High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise.arXiv preprint arXiv:2310.01860, 2023

Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise.arXiv preprint arXiv:2310.01860, 2023

work page arXiv 2023

[52] [52]

Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

work page 2019

[53] [53]

Disa: A dual inexact splitting algorithm for distributed convex composite optimization.IEEE Transactions on Automatic Control, 69(5):2995–3010, 2023

Luyao Guo, Xinli Shi, Shaofu Yang, and Jinde Cao. Disa: A dual inexact splitting algorithm for distributed convex composite optimization.IEEE Transactions on Automatic Control, 69(5):2995–3010, 2023

work page 2023

[54] [54]

The heavy-tail phenomenon in SGD

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in SGD. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

work page 2021

[55] [55]

Comparison of optimization techniques based on gradient descent algorithm: A review.PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4):2715–2743, 2021

Saad Hikmat Haji and Adnan Mohsin Abdulazeez. Comparison of optimization techniques based on gradient descent algorithm: A review.PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4):2715–2743, 2021

work page 2021

[56] [56]

A primal-dual algorithm with line search for general convex-concave saddle point problems.SIAM Journal on Optimization, 31(2):1299–1329, 2021

Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithm with line search for general convex-concave saddle point problems.SIAM Journal on Optimization, 31(2):1299–1329, 2021

work page 2021

[57] [57]

J. M. Hammersley and J. G. Mauldon. General principles of antithetic variates.Mathematical Proceedings of the Cambridge Philosophical Society, 52(3):476–481, July 1956

work page 1956

[58] [58]

J. M. Hammersley and K. W. Morton. A new Monte Carlo technique: Antithetic variates.Mathematical Proceedings of the Cambridge Philosophical Society, 52(3):449–475, July 1956. 12

work page 1956

[59] [59]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019

work page 2019

[60] [60]

Hashemi, D

A. Hashemi, D. Lee, and A. Makur. Strong antithetic variates: Theory and applications. Preprint, pp. 1–13, 2025

work page 2025

[61] [61]

A unified model for large-scale inexact fixed-point iteration: A stochastic optimization perspective.IEEE Transactions on Automatic Control, 70(4):2435–2449, 2024

Abolfazl Hashemi. A unified model for large-scale inexact fixed-point iteration: A stochastic optimization perspective.IEEE Transactions on Automatic Control, 70(4):2435–2449, 2024

work page 2024

[62] [62]

RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

Abolfazl Hashemi. Rampage: Randomized mid-point for debiased gradient extrapolation.arXiv preprint arXiv:2603.22155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

Strong antithetic variance reduction inequalities

Abolfazl Hashemi, Dogmin Lee, and Anuran Makur. Strong antithetic variance reduction inequalities. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pages 1–6, Ann Arbor, MI, USA, June 22–27 2025

work page 2025

[64] [64]

Resilient constrained learning.Advances in Neural Information Processing Systems, 36:71767–71798, 2023

Ignacio Hounie, Alejandro Ribeiro, and Luiz FO Chamon. Resilient constrained learning.Advances in Neural Information Processing Systems, 36:71767–71798, 2023

work page 2023

[65] [65]

Yankun Huang and Qihang Lin. Oracle complexity of single-loop switching subgradient methods for non- smooth weakly convex functional constrained optimization.Advances in Neural Information Processing Systems, 36:61327–61340, 2023

work page 2023

[66] [66]

Distributed proximal algorithms for nonsmooth optimiza- tion: Unified convergence analysis.IEEE Transactions on Automatic Control, 2025

Yi Huang, Shisheng Cui, Jian Sun, and Ziyang Meng. Distributed proximal algorithms for nonsmooth optimiza- tion: Unified convergence analysis.IEEE Transactions on Automatic Control, 2025

work page 2025

[67] [67]

Distributed multiproximal algorithm for nonsmooth convex optimization with coupled inequality constraints.IEEE Transactions on Automatic Control, 68(12):8126–8133, 2023

Yi Huang, Ziyang Meng, Jian Sun, and Wei Ren. Distributed multiproximal algorithm for nonsmooth convex optimization with coupled inequality constraints.IEEE Transactions on Automatic Control, 68(12):8126–8133, 2023

work page 2023

[68] [68]

Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

work page arXiv 2023

[69] [69]

Safe-ef: Error feedback for non-smooth constrained optimiza- tion

Rustem Islamov, Yarden As, and Ilyas Fatkhullin. Safe-ef: Error feedback for non-smooth constrained optimiza- tion. InForty-second International Conference on Machine Learning, 2025

work page 2025

[70] [70]

Safe-EF: Error feedback for non-smooth constrained optimiza- tion

Rustem Islamov, Yarden As, and Ilyas Fatkhullin. Safe-EF: Error feedback for non-smooth constrained optimiza- tion. InForty-second International Conference on Machine Learning, 2025

work page 2025

[71] [71]

Unconstrained online learning with unbounded losses

Andrew Jacobsen and Ashok Cutkosky. Unconstrained online learning with unbounded losses. InInternational Conference on Machine Learning, pages 14590–14630. PMLR, 2023

work page 2023

[72] [72]

Jadbabaie, A

A. Jadbabaie, A. Makur, and A. Reisizadeh. Adaptive low-rank gradient descent. InProceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 3315–3320, Singapore, December 13–15 2023

work page 2023

[73] [73]

Jadbabaie, A

A. Jadbabaie, A. Makur, and D. Shah. Federated optimization of smooth loss functions.IEEE Transactions on Information Theory, 69(12):7836–7866, December 2023

work page 2023

[74] [74]

Jadbabaie, A

A. Jadbabaie, A. Makur, and D. Shah. Gradient-based empirical risk minimization using local polynomial regression.Stochastic Systems, 14(4):363–402, December 2024

work page 2024

[75] [75]

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018

work page 2018

[76] [76]

First-order methods for nonsmooth nonconvex functional constrained optimization with or without slater points.SIAM Journal on Optimization, 35(2):1300–1329, 2025

Zhichao Jia and Benjamin Grimmer. First-order methods for nonsmooth nonconvex functional constrained optimization with or without slater points.SIAM Journal on Optimization, 35(2):1300–1329, 2025

work page 2025

[77] [77]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013. 13

work page 2013

[78] [78]

Large deviations of vector-valued martingales in 2-smooth normed spaces.arXiv preprint arXiv:0809.0813, 2008

Anatoli Juditsky and Arkadii S Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces.arXiv preprint arXiv:0809.0813, 2008

work page arXiv 2008

[79] [79]

Kahn and A

H. Kahn and A. W. Marshall. Methods of reducing sample size in Monte Carlo computations.Journal of the Operations Research Society of America, 1(5):263–278, November 1953

work page 1953

[80] [80]

Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

work page 2023