pith. sign in

arxiv: 2605.15522 · v1 · pith:J5AGE4GInew · submitted 2026-05-15 · 🧮 math.OC · cs.LG

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Pith reviewed 2026-05-19 15:11 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords stochastic optimizationconvex optimizationAdamWgradient clippinggeneralized Lipschitzunbounded gradientsconvergence ratesquasar-convex
0
0 comments X

The pith

AdamW with clipped updates outperforms SGD and AdaGrad on convex stochastic problems with unbounded gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generalized Lipschitz class of functions in which gradient norms grow at most linearly with the current optimality gap, replacing the common but restrictive assumption of uniformly bounded gradients. It then compares several first-order stochastic methods under this assumption and establishes faster global convergence for AdamW when its updates are clipped. The analysis also isolates the benefit of AdamW's exponentially weighted gradient accumulation over plain averaging, and shows that the same clipped variant remains competitive under generalized smoothness and quasar-convexity.

Core claim

For convex stochastic generalized Lipschitz optimization problems, AdamW with clipped updates achieves the best global convergence rates among popular stochastic optimization methods such as SGD and AdaGrad; the exponentially weighted gradient accumulation of AdamW is essential to these rates, and the same clipped procedure also yields improved rates under generalized smoothness while extending to quasar-convex and preconditioned settings.

What carries the argument

The generalized Lipschitz condition that bounds gradient norm by an affine function of the optimality gap, together with the clipped AdamW update that combines exponential moving averages and gradient clipping.

If this is right

  • Clipped AdamW remains competitive and achieves improved rates when the objective also satisfies the popular generalized smoothness assumption.
  • The exponentially weighted accumulation inside AdamW, rather than simple averaging, is necessary for the superior rates.
  • The same clipped AdamW analysis carries over to versions that use diagonal or matrix preconditioners.
  • The convergence guarantees extend from convex to quasar-convex objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework suggests testing whether clipping plus momentum improves empirical performance on non-smooth deep-learning tasks whose gradients grow with parameter distance from a good solution.
  • Similar rate comparisons could be carried out for other adaptive methods such as RMSProp or Lion under the same generalized Lipschitz assumption.
  • The affine bound on gradients might be relaxed further to sublinear or logarithmic growth while preserving the advantage of clipped AdamW.

Load-bearing premise

The objective function satisfies that its gradient norm is bounded by a linear function of the current distance to optimality.

What would settle it

A convex function belonging to the generalized Lipschitz class on which SGD with tuned stepsizes converges at a strictly better rate than clipped AdamW.

read the original abstract

Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the class of generalized Lipschitz functions, in which the gradient norm is bounded by an affine function of the optimality gap. It develops convergence analyses for several stochastic first-order methods under this assumption for convex problems and claims that AdamW with clipped updates achieves strictly superior global rates to SGD and AdaGrad. The analysis emphasizes the role of exponentially weighted gradient accumulation rather than simple averaging. Extensions are given to the generalized smoothness setting, diagonal and matrix preconditioners, and the quasar-convex case.

Significance. If the central rates hold under the stated assumptions, the work would meaningfully relax the uniform gradient bound that is common but often unrealistic in non-smooth stochastic optimization. The explicit comparison of clipped AdamW against SGD and AdaGrad, together with the identification of exponential averaging as critical, could influence both theory and practical algorithm choice. No machine-checked proofs or reproducible code are mentioned, but the rates are in principle falsifiable.

major comments (2)
  1. [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.
  2. [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.
minor comments (2)
  1. [Abstract] The abstract states that AdamW 'theoretically outperforms' other methods; the precise rates (e.g., dependence on T, constants, and noise parameters) should be stated explicitly so readers can verify the improvement.
  2. [Definition of generalized Lipschitz class] Clarify whether the generalized Lipschitz condition is assumed to hold with the same constants a, b for all iterates or only in expectation; this affects how the analysis handles early iterates far from the optimum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and the opportunity to clarify our contributions. We address each major comment below and have revised the manuscript to strengthen the presentation of assumptions and comparisons.

read point-by-point responses
  1. Referee: [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.

    Authors: We thank the referee for highlighting this important distinction. The generalized Lipschitz condition bounds only the true gradient; our analysis of clipped AdamW (and the comparison to SGD) additionally invokes a standard uniform bound on the variance of the stochastic gradients that is independent of the optimality gap. This assumption is present in the original submission but was not sufficiently foregrounded. In the revision we have added an explicit statement (Assumption 2.3) and a new remark (Remark 3.2) that isolates the role of bounded variance. We also include a short counter-example showing that if variance were allowed to grow quadratically with the gradient norm, the exponential-averaging advantage of AdamW would indeed disappear and the rates would collapse to those of clipped SGD. Thus the bounded-variance assumption is necessary for the claimed strict improvement. revision: yes

  2. Referee: [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.

    Authors: We agree that identical noise assumptions are required for the comparison to be meaningful. All three algorithms were originally analyzed under the same pair of assumptions (generalized Lipschitz plus uniformly bounded stochastic-gradient variance). To make the parallelism transparent, we have re-derived the SGD and AdaGrad rates in a single unified theorem (Theorem 3.1) that uses exactly the same noise model and proof template as the AdamW result (Theorem 3.4). The rates are now displayed side-by-side in Table 1, which also isolates the improvement attributable to exponential averaging. This revision removes any ambiguity about differing noise models. revision: yes

Circularity Check

0 steps flagged

No circularity: new function class analyzed with standard techniques

full rationale

The paper defines a new generalized Lipschitz class (gradient norm bounded affinely by optimality gap) and performs fresh convergence analysis on existing algorithms including clipped AdamW. No step reduces a claimed rate or outperformance result to a fitted parameter, self-referential definition, or load-bearing self-citation; the central comparison follows from applying standard stochastic analysis tools to the new assumption class. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claims rest on the newly defined generalized Lipschitz property plus standard background assumptions of convexity and unbiased stochastic gradients.

axioms (2)
  • domain assumption The objective function is convex.
    Standard assumption invoked for the optimization setting.
  • domain assumption Stochastic gradients are unbiased estimates of the true gradient.
    Common modeling assumption in stochastic first-order methods.
invented entities (1)
  • generalized Lipschitz functions no independent evidence
    purpose: Model non-smooth convex functions whose gradient norms grow affinely with optimality gap
    Newly introduced class that replaces uniform bounded-gradient assumption.

pith-pipeline@v0.9.0 · 5683 in / 1230 out tokens · 75511 ms · 2026-05-19T15:11:36.275720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  2. [2]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

  3. [3]

    arXiv preprint arXiv:1905.11881 , year=

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881,

  4. [4]

    Near-optimal methods for minimizing star-convex functions and beyond

    Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pages 1894–1938. PMLR,

  5. [5]

    Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity

    Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtárik, Samuel Horváth, and Martin Takáˇc. Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity. arXiv preprint arXiv:2409.14989,

  6. [6]

    Less Regret via Online Conditioning

    Matthew Streeter and H Brendan McMahan. Less regret via online conditioning. arXiv preprint arXiv:1002.4862,

  7. [7]

    Jiang, Z

    Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232,

  8. [8]

    Optimal Projection-Free Adaptive SGD for Matrix Optimization

    Dmitry Kovalev. Optimal projection-free adaptive sgd for matrix optimization. arXiv preprint arXiv:2604.02505,

  9. [9]

    A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

    Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395,

  10. [10]

    Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed

    Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáˇc, and Eduard Gorbunov. Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed. arXiv preprint arXiv:2406.04443,

  11. [11]

    Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise

    Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise. arXiv preprint arXiv:2402.01567,

  12. [12]

    Random scaling and momentum for non-smooth non-convex optimization

    Qinzi Zhang and Ashok Cutkosky. Random scaling and momentum for non-smooth non-convex optimization. arXiv preprint arXiv:2405.09742,

  13. [13]

    SGD Converges to Global Minimum in Deep Learning via Star-convex Path

    Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451,

  14. [14]

    Reevaluating theoretical analysis methods for optimization in deep learning

    Hoang Tran, Qinzi Zhang, and Ashok Cutkosky. Reevaluating theoretical analysis methods for optimization in deep learning. arXiv preprint arXiv:2407.01825,

  15. [15]

    Convergence of adam under relaxed assumptions

    Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36:52166–52196, 2023a. Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems...

  16. [16]

    Kovalev and E

    Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466,

  17. [17]

    Sgd with adaptive preconditioning: Unified analysis and momentum acceleration

    Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803,

  18. [18]

    Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd

    for more details on quasar-convex functions. Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd. However, we can define the generalized subdifferential ∂◦f(x) = ∂f ◦(x; · ). Here, f ◦(x; w) is the generalized directional derivative [Clarke, 1990], which is a convex function with respect to the second argument and is d...

  19. [19]

    We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

    or low-rank gradients (Algorithm 4). We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev

  20. [20]

    Hence, by the Monotonicity Theorem [Hagood and Thomson, 2006], we conclude that the function r(t) is non-increasing, which implies r(t) ≤ 0 for all t ∈ R+. Next, since the function p(t) is continuous, the derivative ˙q(t) of the function q(t) exists and can be upper-bounded as follows: ˙q(t) (a) = (M0 + M1p(t))∥x′ − x∥2 ≤ (M0 + M1[p(0) + |p(t) − p(0)|])∥x...

  21. [21]

    (15); (d) uses the definition in eq

    + 2ηG2 0 (c) ≤ O h RG0√ K + 1 + [RG1]2F K + 1 i (d) ≤ O[ϵ], where (a) uses the convexity of the function f(x) and the definition of xK; (b) uses the inequality above and the fact that ∥x∗∥2 ≤ R and x0 = 0; (c) uses the definitions in eq. (15); (d) uses the definition in eq. (15), Remark 1, and the inequality in eq. (7). E.4 Proof of Theorem 4 Using the de...

  22. [22]

    Let p ∈ (0, 1/2], and let {∆k}k∈N ⊂ R+ and {δk}k∈N ⊂ R+ be two sequences of non-negative numbers satisfying the following condition for all k ∈ N: πk∆k ≤ δk + pγPk i=1αiπi∆i, (55) where αk, πk satisfy eq. (47). Then, the following inequalities hold for all k ∈ N: Pk i=1αiπi∆i ≤Pk i=1αiδi[πk/πi−1]p. (56) Proof. We define τk ≥ 0 for k ∈ N as follows: τk =Pk...