Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu

arxiv: 2512.23178 · v3 · pith:ZDPKG2NDnew · submitted 2025-12-29 · 🧮 math.OC · cs.LG· stat.ML

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu This is my paper

Pith reviewed 2026-05-21 17:10 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords clipped SGDheavy-tailed noisenonsmooth convex optimizationgeneralized effective dimensionconvergence rateshigh-probability boundsin-expectation convergenceFreedman's inequality

0 comments

The pith

Clipped SGD achieves faster convergence rates for nonsmooth convex optimization under heavy-tailed noise using a generalized effective dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides a refined analysis of clipped stochastic gradient descent when gradient noise has only a bounded p-th moment for p between 1 and 2. It derives improved high-probability convergence rates that incorporate a generalized effective dimension to tighten the bounds compared to previous results. The analysis also extends to convergence in expectation, producing rates that exceed previously known lower bounds, and establishes matching lower bounds to show optimality in the expectation case. This matters because heavy-tailed noise is common in modern machine learning, and better rates mean more efficient optimization in practice.

Core claim

By improving the utilization of Freedman's inequality and deriving finer bounds for the clipping error under heavy-tailed noise, Clipped SGD guarantees faster high-probability rates of O(σ_l d_eff^{-1/(2p)} ln^{1-1/p}(1/δ) T^{1/p-1}) for nonsmooth convex problems and O(σ_l^2 d_eff^{-1/p} ln^{2-2/p}(1/δ) T^{2/p-2}) for strongly convex ones, where d_eff is the generalized effective dimension. The refined analysis further yields new rates for convergence in expectation that break known lower bounds, and new lower bounds are provided that match these upper bounds, establishing optimality for in-expectation convergence.

What carries the argument

The generalized effective dimension d_eff, which allows for tighter bounds by capturing a dimension-like quantity in the analysis of clipping errors and concentration inequalities.

If this is right

High-probability rates improve by polynomial factors in the generalized effective dimension.
Convergence in expectation is optimal because the new upper bounds match the established lower bounds.
The refined clipping error bounds and Freedman's inequality application apply to both nonsmooth convex and strongly convex settings.
Gradient clipping handles heavy-tailed noise more efficiently than prior analyses suggested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In problems where the effective dimension stays much smaller than ambient dimension, the practical speed-up could be large even in high-dimensional settings.
Similar refinements might apply to other first-order methods that use clipping or truncation under heavy tails.
The matching lower bounds imply that further rate improvements would need relaxed assumptions or entirely different algorithmic structures.

Load-bearing premise

The gradient noise has a bounded p-th moment upper bounded by σ_l^p for some p in (1,2].

What would settle it

An experiment or calculation where the observed convergence rate of clipped SGD does not improve when the generalized effective dimension increases or fails to match the new in-expectation upper bounds.

read the original abstract

Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\frak p}$-th moment where ${\frak p}\in(1,2]$ has been recognized to be more realistic (say being upper bounded by $\sigma_{\frak l}^{\frak p}$ for some $\sigma_{\frak l}\ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${\cal O}(\sigma_{\frak l}\ln(1/\delta)T^{1/{\frak p}-1})$ (resp. ${\cal O}(\sigma_{\frak l}^2\ln^2(1/\delta)T^{2/{\frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $\delta\in(0,1]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two rates, ${\cal O}(\sigma_{\frak l}d_{\rm eff}^{-1/2{\frak p}}\ln^{1-1/{\frak p}}(1/\delta)T^{1/{\frak p}-1})$ and ${\cal O}(\sigma_{\frak l}^2d_{\rm eff}^{-1/{\frak p}}\ln^{2-2/{\frak p}}(1/\delta)T^{2/{\frak p}-2})$, faster than the aforementioned best results, where $d_{\rm eff}\ge1$ is a quantity we call the $\textit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tightens clipped SGD rates under heavy-tailed noise via a generalized effective dimension and matches lower bounds for expectation convergence, but the high-probability speedup may reduce to logs if d_eff stays near 1.

read the letter

The main thing here is that this work refines the high-probability rates for clipped SGD on nonsmooth convex problems when gradients have only a p-th moment bound, and it gives matching lower bounds that make the in-expectation rates optimal. They achieve this by sharpening the application of Freedman's inequality and tightening the clipping error terms, which lets them insert a generalized effective dimension d_eff into the bounds and improve the log exponents at the same time. The expectation optimality result stands out as clean and useful on its own terms. That part of the contribution looks solid and worth having in the literature. The soft spot is the one the stress test points to. The abstract states d_eff is at least 1 but does not show whether it is meaningfully larger than 1 for the standard heavy-tailed distributions that motivated the problem. If d_eff collapses to 1 under the usual assumptions, the polynomial improvement disappears and the gain is mainly the milder logs. I would check the full definition and any concrete examples in the proofs before accepting the headline claim of faster rates. This paper is for people who work on stochastic optimization with heavy tails and high-probability guarantees. Readers who follow the clipped SGD line of work will get value from the refined bounds and the matching lower bounds. It has enough new technical content and a clear optimality statement to deserve a serious referee, though the high-probability claims need careful verification on the d_eff term. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes Clipped SGD for nonsmooth convex optimization under heavy-tailed noise with p-th moment bounds (p∈(1,2]). It refines the analysis using improved Freedman's inequality and clipping error bounds to introduce a generalized effective dimension d_eff ≥1, yielding high-probability rates O(σ_l d_eff^{-1/(2p)} ln^{1-1/p}(1/δ) T^{1/p-1}) for convex problems and O(σ_l^2 d_eff^{-1/p} ln^{2-2/p}(1/δ) T^{2/p-2}) for strongly convex ones. The paper also provides in-expectation rates matching new lower bounds, suggesting optimality.

Significance. Should the refined bounds hold, particularly with d_eff providing a non-trivial factor, the results would improve upon existing high-probability guarantees for clipped methods in heavy-tailed settings. The optimality result for convergence in expectation is a notable contribution. The approach builds on standard tools but refines their application, which could influence future analyses in stochastic optimization.

major comments (2)

[§3.2] §3.2, Definition of d_eff: The definition of the generalized effective dimension d_eff is load-bearing for the claimed improvement. The manuscript must clarify whether d_eff is strictly greater than 1 under standard assumptions on the noise (e.g., coordinate-independent p-moment bounded noise). If d_eff=1, the rates reduce to O(σ_l ln^{1-1/p}(1/δ) T^{1/p-1}), which is only a logarithmic improvement over the prior O(σ_l ln(1/δ) T^{1/p-1}), undermining the polynomial speedup narrative.
[Theorem 4.1] Theorem 4.1: The high-probability bound incorporates d_eff^{-1/(2p)}; without an explicit lower bound or example computation showing d_eff >1 for typical heavy-tailed distributions (such as isotropic or coordinate-wise), the practical gain over prior results remains unverified.

minor comments (2)

[Notation] The notation frak p for the moment order is nonstandard; switching to plain p would improve readability and consistency with related literature.
[Section 1] Add a brief comparison table of the new rates versus prior bounds (including the exact dependence on d_eff) to highlight the refinement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. The comments highlight important points about the role and interpretation of the generalized effective dimension d_eff. We address each major comment below and have made revisions to clarify these aspects without altering the core technical contributions.

read point-by-point responses

Referee: [§3.2] §3.2, Definition of d_eff: The definition of the generalized effective dimension d_eff is load-bearing for the claimed improvement. The manuscript must clarify whether d_eff is strictly greater than 1 under standard assumptions on the noise (e.g., coordinate-independent p-moment bounded noise). If d_eff=1, the rates reduce to O(σ_l ln^{1-1/p}(1/δ) T^{1/p-1}), which is only a logarithmic improvement over the prior O(σ_l ln(1/δ) T^{1/p-1}), undermining the polynomial speedup narrative.

Authors: We agree that explicit clarification is needed. In the revised manuscript, Section 3.2 now includes a new paragraph and Remark 3.3 stating that under fully coordinate-independent noise with identical p-moment bounds, d_eff equals 1 and the improvement is logarithmic. However, the definition of d_eff is designed to capture anisotropy or coordinate dependence in the heavy-tailed noise, which is a standard and realistic setting in many applications. For such cases d_eff > 1, yielding the stated polynomial factor. We have added a concrete two-dimensional example with correlated noise components where d_eff = 1.5, confirming the distinction. revision: yes
Referee: [Theorem 4.1] Theorem 4.1: The high-probability bound incorporates d_eff^{-1/(2p)}; without an explicit lower bound or example computation showing d_eff >1 for typical heavy-tailed distributions (such as isotropic or coordinate-wise), the practical gain over prior results remains unverified.

Authors: We thank the referee for this suggestion. The revised version adds Remark 4.2 immediately after Theorem 4.1 together with a short computation in Appendix C. For a coordinate-wise heavy-tailed distribution with heterogeneous p-moment bounds (a common practical case), we explicitly compute d_eff = (sum σ_i^{2p} / (sum σ_i^p)^2)^{1/p} which is strictly greater than 1 whenever the σ_i are not all equal. This provides a verifiable polynomial gain for non-isotropic noise while acknowledging that the gain reduces to logarithmic for the fully isotropic coordinate-independent case. revision: yes

Circularity Check

0 steps flagged

No significant circularity; refined rates derive from improved application of Freedman's inequality and clipping bounds

full rationale

The paper's derivation improves upon prior clipped SGD analyses by refining clipping error bounds under p-moment noise and applying a tighter version of Freedman's inequality, yielding rates that incorporate a new generalized effective dimension d_eff >=1. These steps rely on standard external tools (Freedman's inequality) and do not reduce the claimed upper bounds to self-defined quantities or self-citation chains. The in-expectation lower bounds are separately established to match the new upper bounds, confirming independence rather than tautology. No load-bearing step equates a prediction to a fitted input or renames a known result via ansatz. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard domain assumptions for convex optimization and heavy-tailed noise models, plus the introduction of the generalized effective dimension as a new quantity to refine the bounds.

axioms (2)

domain assumption The objective function is convex (nonsmooth).
Core modeling assumption for the problem class in the abstract.
domain assumption Gradient noise has bounded p-th moment for p in (1,2] upper bounded by σ_l^p.
Key assumption enabling the clipping error analysis and concentration bounds.

invented entities (1)

generalized effective dimension d_eff no independent evidence
purpose: To refine convergence rates by incorporating a dimension-like factor into the bounds.
New quantity defined and used in the paper to achieve the improved rates.

pith-pipeline@v0.9.0 · 5966 in / 1546 out tokens · 209488 ms · 2026-05-21T17:10:29.626611+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

d_eff ≜ σ_l² / σ_s² ∈ {0} ∪ [1, π d/2] = O(d) (Eq. 1); recovers effective dimension Tr(Σ)/‖Σ‖ when p=2
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 1: finer bounds E[‖d_u_t‖²] ≤ O(σ_l^p τ^{2-p}), ‖E[d_u_t d_u_t^⊤]‖ ≤ O(σ_s^p τ^{2-p} + σ_l^p G² τ^{-p})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Freedman, Another note on the borel-cantelli lemma and the strong law, with the poisson approxi- mation as a by-product, Annals of Probability 1 (6) (1973) 910–925

URLhttps://proceedings.mlr.press/v125/davis20a.html. John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and paral- lelism when data is sparse. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Cur- ran Associates, Inc., 2013. URLhttps:/...

work page doi:10.1214/aop/1176996452 2013
[2]

URLhttps://proceedings.mlr.press/v139/garg21b.html

PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/garg21b.html. E. N. Gilbert. A comparison of signalling alphabets.The Bell System Technical Journal, 31(3):504–522,

work page 2021
[3]

Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov

doi: 10.1002/j.1538-7305.1952.tb01393.x. Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15042–15053. Cur- ran Associates...

work page doi:10.1002/j.1538-7305.1952.tb01393.x 1952
[4]

Matthew J

URLhttps://proceedings.mlr.press/v139/hodgkinson21a.html. Matthew J. Holland. Anytime guarantees under heavy-tailed data.Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6918–6925, Jun. 2022. doi: 10.1609/aaai.v36i6.20649. URLhttps://ojs. aaai.org/index.php/AAAI/article/view/20649. Florian Hübler, Ilyas Fatkhullin, and Niao He. From gr...

work page doi:10.1609/aaai.v36i6.20649 2022
[5]

guarantees that with probability at least1−δ,F( ¯xcvx T+1 )−F ⋆ converges at the rate of O   φ+ ln 3 δ GD T + σ p 2 l G1− p 2 +G D √ T + σ 2 p −1 s σ 2− 2 p l +σ 1 p s σ 1− 1 p l ln1− 1 p 3 δ D T 1− 1 p   . Remark10.There are two points we want to emphasize: First, the choiceα= 1/2is not essential and can be changed to anyα∈(0,1), only resulting i...

work page 2025
[6]

guarantees thatE F( ¯xcvx T+1 )−F ⋆ converges at the rate of O  eφGD T + σ p 2 l G1− p 2 +G D √ T + σ 2 p −1 s σ 2− 2 p l D T 1− 1 p   . Proof.By Lemmas 5 and 6, we can follow a similar argument until (44) in the proof of Theorem 10 to have E F( ¯xcvx T+1 )−F ⋆ ≤ D2 ηT+1 T + 2Bcvx T T , where, underη t =η, τ t = max n G 1−α , τ T 1 p o ,∀t∈[T]forη, τ ...

work page 2023
[7]

Proof.First, the choice ofη t = 6 µt ,∀t∈[T]satisfiesη t ≤ η µ ,∀t∈[T]forη= 6, fulfilling the requirement of Lemma 7

guarantees that with probability at least1−δ, bothF(¯xstr T+1 )−F ⋆ andµ∥x T+1 −x ⋆∥2 converge at the rate of O   µD2 T 3 + φ2 + ln2 3 δ G2 µT 2 + σp l +σ p s ln 3 δ G2−p +G 2 µT + σ 4 p −2 s σ 4− 4 p l +σ 2 p s σ 2− 2 p l ln2− 2 p 3 δ µT 2− 2 p   , whereφ≤φ ⋆ is a constant defined in (46) and equalsφ⋆ whenT= Ω Gp σp l φ⋆ . Proof.First, the choice ofη...

work page 2020
[8]

Next, we give two one-step descent inequalities for our algorithms

to set upτt now, which we believe could be useful for future work. Next, we give two one-step descent inequalities for our algorithms. The analysis is standard in the literature, which we reproduce here for completeness. Lemma 3.Under Assumptions 2 and 3, for anyy∈Xandt∈N: •Clipped SGD (Algorithm 1) guarantees F(x t+1)−F(y)≤ ∥y−x t∥2 2ηt − (1 +µη t)∥y−x t...

work page 2023
[9]

for anyδ∈(0,1], with probability at least1−δ,I cvx T ≤A cvx T whereI cvx T is defined in Lemma 4 and Acvx T is a constant in the order of O  max t∈[T] ηtτ 2 t ln2 3 δ + TX t=1 σp l ηt τ p−2 t + TX t=1 σsσp−1 l √ηt τ p−1 t + σp l G√ηt αp−1τ p t !2 + TX t=1 G2ηt   . 40 2.J cvx T ≤B cvx T whereJ cvx T is defined in Lemma 5 andBcvx T is a constant in the ...

work page
[10]

for anyδ∈(0,1], with probability at least1−δ,I str T ≤A str T whereI str T is defined in Lemma 7 andAstr T is a constant in the order of O max t∈[T] Γtη2 t τ 2 t ln2 3 δ + TX t=1 σp l Γtη2 t τ p−2 t + TX t=1 σp sΓtη2 t τ p−2 t + σp l G2Γtη2 t αp−1τ p t ln 3 δ + TX t=1 σ2 s σ2p−2 l Γtηt τ 2p−2 t + σ2p l G2Γtηt α2p−2τ 2p t ! 2η+ 1 µ + TX t=1 G2Γtη2 t ! . 2....

work page 1975
[11]

At the beginning, the algorithm choosesx1 =A 0(r(·))

work page
[12]

At thet-th iteration fort∈[T], the algorithm queries and observes the stochastic gradientgt =g(x t, f) and setsx t+1 =A t (r(·),g 1,· · ·,g t). Minimax lower bound.Under the above protocol, our goal is to lower bound the following two quantities fortype∈ {cvx,str}andδ∈(0,1](where we recallF ⋆ = inf x∈Rd F(x)), Rtype ⋆ ≜sup g∈Gp σs ,σl inf A0:T ∈AT sup F∈F...

work page 2013

[1] [1]

Freedman, Another note on the borel-cantelli lemma and the strong law, with the poisson approxi- mation as a by-product, Annals of Probability 1 (6) (1973) 910–925

URLhttps://proceedings.mlr.press/v125/davis20a.html. John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and paral- lelism when data is sparse. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Cur- ran Associates, Inc., 2013. URLhttps:/...

work page doi:10.1214/aop/1176996452 2013

[2] [2]

URLhttps://proceedings.mlr.press/v139/garg21b.html

PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/garg21b.html. E. N. Gilbert. A comparison of signalling alphabets.The Bell System Technical Journal, 31(3):504–522,

work page 2021

[3] [3]

Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov

doi: 10.1002/j.1538-7305.1952.tb01393.x. Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15042–15053. Cur- ran Associates...

work page doi:10.1002/j.1538-7305.1952.tb01393.x 1952

[4] [4]

Matthew J

URLhttps://proceedings.mlr.press/v139/hodgkinson21a.html. Matthew J. Holland. Anytime guarantees under heavy-tailed data.Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6918–6925, Jun. 2022. doi: 10.1609/aaai.v36i6.20649. URLhttps://ojs. aaai.org/index.php/AAAI/article/view/20649. Florian Hübler, Ilyas Fatkhullin, and Niao He. From gr...

work page doi:10.1609/aaai.v36i6.20649 2022

[5] [5]

guarantees that with probability at least1−δ,F( ¯xcvx T+1 )−F ⋆ converges at the rate of O   φ+ ln 3 δ GD T + σ p 2 l G1− p 2 +G D √ T + σ 2 p −1 s σ 2− 2 p l +σ 1 p s σ 1− 1 p l ln1− 1 p 3 δ D T 1− 1 p   . Remark10.There are two points we want to emphasize: First, the choiceα= 1/2is not essential and can be changed to anyα∈(0,1), only resulting i...

work page 2025

[6] [6]

guarantees thatE F( ¯xcvx T+1 )−F ⋆ converges at the rate of O  eφGD T + σ p 2 l G1− p 2 +G D √ T + σ 2 p −1 s σ 2− 2 p l D T 1− 1 p   . Proof.By Lemmas 5 and 6, we can follow a similar argument until (44) in the proof of Theorem 10 to have E F( ¯xcvx T+1 )−F ⋆ ≤ D2 ηT+1 T + 2Bcvx T T , where, underη t =η, τ t = max n G 1−α , τ T 1 p o ,∀t∈[T]forη, τ ...

work page 2023

[7] [7]

Proof.First, the choice ofη t = 6 µt ,∀t∈[T]satisfiesη t ≤ η µ ,∀t∈[T]forη= 6, fulfilling the requirement of Lemma 7

guarantees that with probability at least1−δ, bothF(¯xstr T+1 )−F ⋆ andµ∥x T+1 −x ⋆∥2 converge at the rate of O   µD2 T 3 + φ2 + ln2 3 δ G2 µT 2 + σp l +σ p s ln 3 δ G2−p +G 2 µT + σ 4 p −2 s σ 4− 4 p l +σ 2 p s σ 2− 2 p l ln2− 2 p 3 δ µT 2− 2 p   , whereφ≤φ ⋆ is a constant defined in (46) and equalsφ⋆ whenT= Ω Gp σp l φ⋆ . Proof.First, the choice ofη...

work page 2020

[8] [8]

Next, we give two one-step descent inequalities for our algorithms

to set upτt now, which we believe could be useful for future work. Next, we give two one-step descent inequalities for our algorithms. The analysis is standard in the literature, which we reproduce here for completeness. Lemma 3.Under Assumptions 2 and 3, for anyy∈Xandt∈N: •Clipped SGD (Algorithm 1) guarantees F(x t+1)−F(y)≤ ∥y−x t∥2 2ηt − (1 +µη t)∥y−x t...

work page 2023

[9] [9]

for anyδ∈(0,1], with probability at least1−δ,I cvx T ≤A cvx T whereI cvx T is defined in Lemma 4 and Acvx T is a constant in the order of O  max t∈[T] ηtτ 2 t ln2 3 δ + TX t=1 σp l ηt τ p−2 t + TX t=1 σsσp−1 l √ηt τ p−1 t + σp l G√ηt αp−1τ p t !2 + TX t=1 G2ηt   . 40 2.J cvx T ≤B cvx T whereJ cvx T is defined in Lemma 5 andBcvx T is a constant in the ...

work page

[10] [10]

for anyδ∈(0,1], with probability at least1−δ,I str T ≤A str T whereI str T is defined in Lemma 7 andAstr T is a constant in the order of O max t∈[T] Γtη2 t τ 2 t ln2 3 δ + TX t=1 σp l Γtη2 t τ p−2 t + TX t=1 σp sΓtη2 t τ p−2 t + σp l G2Γtη2 t αp−1τ p t ln 3 δ + TX t=1 σ2 s σ2p−2 l Γtηt τ 2p−2 t + σ2p l G2Γtηt α2p−2τ 2p t ! 2η+ 1 µ + TX t=1 G2Γtη2 t ! . 2....

work page 1975

[11] [11]

At the beginning, the algorithm choosesx1 =A 0(r(·))

work page

[12] [12]

At thet-th iteration fort∈[T], the algorithm queries and observes the stochastic gradientgt =g(x t, f) and setsx t+1 =A t (r(·),g 1,· · ·,g t). Minimax lower bound.Under the above protocol, our goal is to lower bound the following two quantities fortype∈ {cvx,str}andδ∈(0,1](where we recallF ⋆ = inf x∈Rd F(x)), Rtype ⋆ ≜sup g∈Gp σs ,σl inf A0:T ∈AT sup F∈F...

work page 2013