Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Dmitry Kovalev

arxiv: 2605.15522 · v1 · pith:J5AGE4GInew · submitted 2026-05-15 · 🧮 math.OC · cs.LG

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Dmitry Kovalev This is my paper

Pith reviewed 2026-05-19 15:11 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords stochastic optimizationconvex optimizationAdamWgradient clippinggeneralized Lipschitzunbounded gradientsconvergence ratesquasar-convex

0 comments

The pith

AdamW with clipped updates outperforms SGD and AdaGrad on convex stochastic problems with unbounded gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generalized Lipschitz class of functions in which gradient norms grow at most linearly with the current optimality gap, replacing the common but restrictive assumption of uniformly bounded gradients. It then compares several first-order stochastic methods under this assumption and establishes faster global convergence for AdamW when its updates are clipped. The analysis also isolates the benefit of AdamW's exponentially weighted gradient accumulation over plain averaging, and shows that the same clipped variant remains competitive under generalized smoothness and quasar-convexity.

Core claim

For convex stochastic generalized Lipschitz optimization problems, AdamW with clipped updates achieves the best global convergence rates among popular stochastic optimization methods such as SGD and AdaGrad; the exponentially weighted gradient accumulation of AdamW is essential to these rates, and the same clipped procedure also yields improved rates under generalized smoothness while extending to quasar-convex and preconditioned settings.

What carries the argument

The generalized Lipschitz condition that bounds gradient norm by an affine function of the optimality gap, together with the clipped AdamW update that combines exponential moving averages and gradient clipping.

If this is right

Clipped AdamW remains competitive and achieves improved rates when the objective also satisfies the popular generalized smoothness assumption.
The exponentially weighted accumulation inside AdamW, rather than simple averaging, is necessary for the superior rates.
The same clipped AdamW analysis carries over to versions that use diagonal or matrix preconditioners.
The convergence guarantees extend from convex to quasar-convex objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework suggests testing whether clipping plus momentum improves empirical performance on non-smooth deep-learning tasks whose gradients grow with parameter distance from a good solution.
Similar rate comparisons could be carried out for other adaptive methods such as RMSProp or Lion under the same generalized Lipschitz assumption.
The affine bound on gradients might be relaxed further to sublinear or logarithmic growth while preserving the advantage of clipped AdamW.

Load-bearing premise

The objective function satisfies that its gradient norm is bounded by a linear function of the current distance to optimality.

What would settle it

A convex function belonging to the generalized Lipschitz class on which SGD with tuned stepsizes converges at a strictly better rate than clipped AdamW.

read the original abstract

Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper relaxes uniform gradient bounds to a gap-dependent class and claims clipped AdamW beats SGD and AdaGrad, but the noise scaling needs explicit checking.

read the letter

The paper drops the usual uniform bound on gradients and replaces it with a generalized Lipschitz condition: the gradient norm is at most an affine function of the current optimality gap. This matches the behavior in many machine-learning problems where gradients start large and shrink as training progresses. Under this assumption they compare several clipped stochastic methods on convex problems and conclude that AdamW with clipping achieves the best rates, with the exponential moving average playing a key role over plain averaging. They also carry the same style of analysis to generalized smoothness and quasar-convex objectives. That extension is useful because it shows the result is not tied to one narrow setting. The derivations follow standard stochastic approximation steps but applied to the wider function class, and the explicit rate comparisons give a clear ordering among the algorithms. If the proofs are tight, this supplies a concrete reason to prefer the clipped adaptive method when gradients are not uniformly bounded. The soft spot is the handling of stochastic noise. The new class controls the true gradient, yet it does not automatically bound the variance of the stochastic estimator. If that variance grows with the gradient size, the accumulated noise in AdamW’s moving average could erase part of the claimed advantage over clipped SGD. The paper would be stronger with an explicit noise assumption or a short argument showing why the rates survive when variance scales with the gap. This work is aimed at researchers who write or read theory papers on stochastic first-order methods. Anyone comparing adaptive and non-adaptive optimizers under realistic unbounded-gradient conditions will find the comparisons worth reading. The modeling change is practical and the claims are specific enough to be checked, so the paper deserves a serious referee. I would send it out for review and ask the referees to verify the noise terms in the proofs.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the class of generalized Lipschitz functions, in which the gradient norm is bounded by an affine function of the optimality gap. It develops convergence analyses for several stochastic first-order methods under this assumption for convex problems and claims that AdamW with clipped updates achieves strictly superior global rates to SGD and AdaGrad. The analysis emphasizes the role of exponentially weighted gradient accumulation rather than simple averaging. Extensions are given to the generalized smoothness setting, diagonal and matrix preconditioners, and the quasar-convex case.

Significance. If the central rates hold under the stated assumptions, the work would meaningfully relax the uniform gradient bound that is common but often unrealistic in non-smooth stochastic optimization. The explicit comparison of clipped AdamW against SGD and AdaGrad, together with the identification of exponential averaging as critical, could influence both theory and practical algorithm choice. No machine-checked proofs or reproducible code are mentioned, but the rates are in principle falsifiable.

major comments (2)

[Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.
[Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.

minor comments (2)

[Abstract] The abstract states that AdamW 'theoretically outperforms' other methods; the precise rates (e.g., dependence on T, constants, and noise parameters) should be stated explicitly so readers can verify the improvement.
[Definition of generalized Lipschitz class] Clarify whether the generalized Lipschitz condition is assumed to hold with the same constants a, b for all iterates or only in expectation; this affects how the analysis handles early iterates far from the optimum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and the opportunity to clarify our contributions. We address each major comment below and have revised the manuscript to strengthen the presentation of assumptions and comparisons.

read point-by-point responses

Referee: [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.

Authors: We thank the referee for highlighting this important distinction. The generalized Lipschitz condition bounds only the true gradient; our analysis of clipped AdamW (and the comparison to SGD) additionally invokes a standard uniform bound on the variance of the stochastic gradients that is independent of the optimality gap. This assumption is present in the original submission but was not sufficiently foregrounded. In the revision we have added an explicit statement (Assumption 2.3) and a new remark (Remark 3.2) that isolates the role of bounded variance. We also include a short counter-example showing that if variance were allowed to grow quadratically with the gradient norm, the exponential-averaging advantage of AdamW would indeed disappear and the rates would collapse to those of clipped SGD. Thus the bounded-variance assumption is necessary for the claimed strict improvement. revision: yes
Referee: [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.

Authors: We agree that identical noise assumptions are required for the comparison to be meaningful. All three algorithms were originally analyzed under the same pair of assumptions (generalized Lipschitz plus uniformly bounded stochastic-gradient variance). To make the parallelism transparent, we have re-derived the SGD and AdaGrad rates in a single unified theorem (Theorem 3.1) that uses exactly the same noise model and proof template as the AdamW result (Theorem 3.4). The rates are now displayed side-by-side in Table 1, which also isolates the improvement attributable to exponential averaging. This revision removes any ambiguity about differing noise models. revision: yes

Circularity Check

0 steps flagged

No circularity: new function class analyzed with standard techniques

full rationale

The paper defines a new generalized Lipschitz class (gradient norm bounded affinely by optimality gap) and performs fresh convergence analysis on existing algorithms including clipped AdamW. No step reduces a claimed rate or outperformance result to a fitted parameter, self-referential definition, or load-bearing self-citation; the central comparison follows from applying standard stochastic analysis tools to the new assumption class. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claims rest on the newly defined generalized Lipschitz property plus standard background assumptions of convexity and unbiased stochastic gradients.

axioms (2)

domain assumption The objective function is convex.
Standard assumption invoked for the optimization setting.
domain assumption Stochastic gradients are unbiased estimates of the true gradient.
Common modeling assumption in stochastic first-order methods.

invented entities (1)

generalized Lipschitz functions no independent evidence
purpose: Model non-smooth convex functions whose gradient norms grow affinely with optimality gap
Newly introduced class that replaces uniform bounded-gradient assumption.

pith-pipeline@v0.9.0 · 5683 in / 1230 out tokens · 75511 ms · 2026-05-19T15:11:36.275720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:1905.11881 , year=

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881,

work page arXiv 1905
[4]

Near-optimal methods for minimizing star-convex functions and beyond

Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pages 1894–1938. PMLR,

work page 1938
[5]

Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity

Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtárik, Samuel Horváth, and Martin Takáˇc. Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity. arXiv preprint arXiv:2409.14989,

work page arXiv
[6]

Less Regret via Online Conditioning

Matthew Streeter and H Brendan McMahan. Less regret via online conditioning. arXiv preprint arXiv:1002.4862,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Jiang, Z

Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232,

work page arXiv
[8]

Optimal Projection-Free Adaptive SGD for Matrix Optimization

Dmitry Kovalev. Optimal projection-free adaptive sgd for matrix optimization. arXiv preprint arXiv:2604.02505,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395,

work page arXiv 2003
[10]

Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed

Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáˇc, and Eduard Gorbunov. Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed. arXiv preprint arXiv:2406.04443,

work page arXiv
[11]

Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise. arXiv preprint arXiv:2402.01567,

work page arXiv
[12]

Random scaling and momentum for non-smooth non-convex optimization

Qinzi Zhang and Ashok Cutkosky. Random scaling and momentum for non-smooth non-convex optimization. arXiv preprint arXiv:2405.09742,

work page arXiv
[13]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[14]

Reevaluating theoretical analysis methods for optimization in deep learning

Hoang Tran, Qinzi Zhang, and Ashok Cutkosky. Reevaluating theoretical analysis methods for optimization in deep learning. arXiv preprint arXiv:2407.01825,

work page arXiv
[15]

Convergence of adam under relaxed assumptions

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36:52166–52196, 2023a. Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems...

work page arXiv
[16]

Kovalev and E

Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466,

work page arXiv
[17]

Sgd with adaptive preconditioning: Unified analysis and momentum acceleration

Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803,

work page arXiv
[18]

Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd

for more details on quasar-convex functions. Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd. However, we can define the generalized subdifferential ∂◦f(x) = ∂f ◦(x; · ). Here, f ◦(x; w) is the generalized directional derivative [Clarke, 1990], which is a convex function with respect to the second argument and is d...

work page 1990
[19]

We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

or low-rank gradients (Algorithm 4). We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

work page 2025
[20]

Hence, by the Monotonicity Theorem [Hagood and Thomson, 2006], we conclude that the function r(t) is non-increasing, which implies r(t) ≤ 0 for all t ∈ R+. Next, since the function p(t) is continuous, the derivative ˙q(t) of the function q(t) exists and can be upper-bounded as follows: ˙q(t) (a) = (M0 + M1p(t))∥x′ − x∥2 ≤ (M0 + M1[p(0) + |p(t) − p(0)|])∥x...

work page 2006
[21]

(15); (d) uses the definition in eq

+ 2ηG2 0 (c) ≤ O h RG0√ K + 1 + [RG1]2F K + 1 i (d) ≤ O[ϵ], where (a) uses the convexity of the function f(x) and the definition of xK; (b) uses the inequality above and the fact that ∥x∗∥2 ≤ R and x0 = 0; (c) uses the definitions in eq. (15); (d) uses the definition in eq. (15), Remark 1, and the inequality in eq. (7). E.4 Proof of Theorem 4 Using the de...

work page 2010
[22]

Let p ∈ (0, 1/2], and let {∆k}k∈N ⊂ R+ and {δk}k∈N ⊂ R+ be two sequences of non-negative numbers satisfying the following condition for all k ∈ N: πk∆k ≤ δk + pγPk i=1αiπi∆i, (55) where αk, πk satisfy eq. (47). Then, the following inequalities hold for all k ∈ N: Pk i=1αiπi∆i ≤Pk i=1αiδi[πk/πi−1]p. (56) Proof. We define τk ≥ 0 for k ∈ N as follows: τk =Pk...

work page 2020

[1] [1]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:1905.11881 , year=

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881,

work page arXiv 1905

[4] [4]

Near-optimal methods for minimizing star-convex functions and beyond

Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pages 1894–1938. PMLR,

work page 1938

[5] [5]

Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity

Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtárik, Samuel Horváth, and Martin Takáˇc. Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity. arXiv preprint arXiv:2409.14989,

work page arXiv

[6] [6]

Less Regret via Online Conditioning

Matthew Streeter and H Brendan McMahan. Less regret via online conditioning. arXiv preprint arXiv:1002.4862,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Jiang, Z

Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232,

work page arXiv

[8] [8]

Optimal Projection-Free Adaptive SGD for Matrix Optimization

Dmitry Kovalev. Optimal projection-free adaptive sgd for matrix optimization. arXiv preprint arXiv:2604.02505,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395,

work page arXiv 2003

[10] [10]

Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed

Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáˇc, and Eduard Gorbunov. Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed. arXiv preprint arXiv:2406.04443,

work page arXiv

[11] [11]

Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise. arXiv preprint arXiv:2402.01567,

work page arXiv

[12] [12]

Random scaling and momentum for non-smooth non-convex optimization

Qinzi Zhang and Ashok Cutkosky. Random scaling and momentum for non-smooth non-convex optimization. arXiv preprint arXiv:2405.09742,

work page arXiv

[13] [13]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[14] [14]

Reevaluating theoretical analysis methods for optimization in deep learning

Hoang Tran, Qinzi Zhang, and Ashok Cutkosky. Reevaluating theoretical analysis methods for optimization in deep learning. arXiv preprint arXiv:2407.01825,

work page arXiv

[15] [15]

Convergence of adam under relaxed assumptions

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36:52166–52196, 2023a. Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems...

work page arXiv

[16] [16]

Kovalev and E

Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466,

work page arXiv

[17] [17]

Sgd with adaptive preconditioning: Unified analysis and momentum acceleration

Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803,

work page arXiv

[18] [18]

Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd

for more details on quasar-convex functions. Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd. However, we can define the generalized subdifferential ∂◦f(x) = ∂f ◦(x; · ). Here, f ◦(x; w) is the generalized directional derivative [Clarke, 1990], which is a convex function with respect to the second argument and is d...

work page 1990

[19] [19]

We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

or low-rank gradients (Algorithm 4). We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

work page 2025

[20] [20]

Hence, by the Monotonicity Theorem [Hagood and Thomson, 2006], we conclude that the function r(t) is non-increasing, which implies r(t) ≤ 0 for all t ∈ R+. Next, since the function p(t) is continuous, the derivative ˙q(t) of the function q(t) exists and can be upper-bounded as follows: ˙q(t) (a) = (M0 + M1p(t))∥x′ − x∥2 ≤ (M0 + M1[p(0) + |p(t) − p(0)|])∥x...

work page 2006

[21] [21]

(15); (d) uses the definition in eq

+ 2ηG2 0 (c) ≤ O h RG0√ K + 1 + [RG1]2F K + 1 i (d) ≤ O[ϵ], where (a) uses the convexity of the function f(x) and the definition of xK; (b) uses the inequality above and the fact that ∥x∗∥2 ≤ R and x0 = 0; (c) uses the definitions in eq. (15); (d) uses the definition in eq. (15), Remark 1, and the inequality in eq. (7). E.4 Proof of Theorem 4 Using the de...

work page 2010

[22] [22]

Let p ∈ (0, 1/2], and let {∆k}k∈N ⊂ R+ and {δk}k∈N ⊂ R+ be two sequences of non-negative numbers satisfying the following condition for all k ∈ N: πk∆k ≤ δk + pγPk i=1αiπi∆i, (55) where αk, πk satisfy eq. (47). Then, the following inequalities hold for all k ∈ N: Pk i=1αiπi∆i ≤Pk i=1αiδi[πk/πi−1]p. (56) Proof. We define τk ≥ 0 for k ∈ N as follows: τk =Pk...

work page 2020