Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
Pith reviewed 2026-05-19 15:11 UTC · model grok-4.3
The pith
AdamW with clipped updates outperforms SGD and AdaGrad on convex stochastic problems with unbounded gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For convex stochastic generalized Lipschitz optimization problems, AdamW with clipped updates achieves the best global convergence rates among popular stochastic optimization methods such as SGD and AdaGrad; the exponentially weighted gradient accumulation of AdamW is essential to these rates, and the same clipped procedure also yields improved rates under generalized smoothness while extending to quasar-convex and preconditioned settings.
What carries the argument
The generalized Lipschitz condition that bounds gradient norm by an affine function of the optimality gap, together with the clipped AdamW update that combines exponential moving averages and gradient clipping.
If this is right
- Clipped AdamW remains competitive and achieves improved rates when the objective also satisfies the popular generalized smoothness assumption.
- The exponentially weighted accumulation inside AdamW, rather than simple averaging, is necessary for the superior rates.
- The same clipped AdamW analysis carries over to versions that use diagonal or matrix preconditioners.
- The convergence guarantees extend from convex to quasar-convex objectives.
Where Pith is reading between the lines
- The framework suggests testing whether clipping plus momentum improves empirical performance on non-smooth deep-learning tasks whose gradients grow with parameter distance from a good solution.
- Similar rate comparisons could be carried out for other adaptive methods such as RMSProp or Lion under the same generalized Lipschitz assumption.
- The affine bound on gradients might be relaxed further to sublinear or logarithmic growth while preserving the advantage of clipped AdamW.
Load-bearing premise
The objective function satisfies that its gradient norm is bounded by a linear function of the current distance to optimality.
What would settle it
A convex function belonging to the generalized Lipschitz class on which SGD with tuned stepsizes converges at a strictly better rate than clipped AdamW.
read the original abstract
Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the class of generalized Lipschitz functions, in which the gradient norm is bounded by an affine function of the optimality gap. It develops convergence analyses for several stochastic first-order methods under this assumption for convex problems and claims that AdamW with clipped updates achieves strictly superior global rates to SGD and AdaGrad. The analysis emphasizes the role of exponentially weighted gradient accumulation rather than simple averaging. Extensions are given to the generalized smoothness setting, diagonal and matrix preconditioners, and the quasar-convex case.
Significance. If the central rates hold under the stated assumptions, the work would meaningfully relax the uniform gradient bound that is common but often unrealistic in non-smooth stochastic optimization. The explicit comparison of clipped AdamW against SGD and AdaGrad, together with the identification of exponential averaging as critical, could influence both theory and practical algorithm choice. No machine-checked proofs or reproducible code are mentioned, but the rates are in principle falsifiable.
major comments (2)
- [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.
- [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.
minor comments (2)
- [Abstract] The abstract states that AdamW 'theoretically outperforms' other methods; the precise rates (e.g., dependence on T, constants, and noise parameters) should be stated explicitly so readers can verify the improvement.
- [Definition of generalized Lipschitz class] Clarify whether the generalized Lipschitz condition is assumed to hold with the same constants a, b for all iterates or only in expectation; this affects how the analysis handles early iterates far from the optimum.
Simulated Author's Rebuttal
We thank the referee for the constructive report and the opportunity to clarify our contributions. We address each major comment below and have revised the manuscript to strengthen the presentation of assumptions and comparisons.
read point-by-point responses
-
Referee: [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.
Authors: We thank the referee for highlighting this important distinction. The generalized Lipschitz condition bounds only the true gradient; our analysis of clipped AdamW (and the comparison to SGD) additionally invokes a standard uniform bound on the variance of the stochastic gradients that is independent of the optimality gap. This assumption is present in the original submission but was not sufficiently foregrounded. In the revision we have added an explicit statement (Assumption 2.3) and a new remark (Remark 3.2) that isolates the role of bounded variance. We also include a short counter-example showing that if variance were allowed to grow quadratically with the gradient norm, the exponential-averaging advantage of AdamW would indeed disappear and the rates would collapse to those of clipped SGD. Thus the bounded-variance assumption is necessary for the claimed strict improvement. revision: yes
-
Referee: [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.
Authors: We agree that identical noise assumptions are required for the comparison to be meaningful. All three algorithms were originally analyzed under the same pair of assumptions (generalized Lipschitz plus uniformly bounded stochastic-gradient variance). To make the parallelism transparent, we have re-derived the SGD and AdaGrad rates in a single unified theorem (Theorem 3.1) that uses exactly the same noise model and proof template as the AdamW result (Theorem 3.4). The rates are now displayed side-by-side in Table 1, which also isolates the improvement attributable to exponential averaging. This revision removes any ambiguity about differing noise models. revision: yes
Circularity Check
No circularity: new function class analyzed with standard techniques
full rationale
The paper defines a new generalized Lipschitz class (gradient norm bounded affinely by optimality gap) and performs fresh convergence analysis on existing algorithms including clipped AdamW. No step reduces a claimed rate or outperformance result to a fitted parameter, self-referential definition, or load-bearing self-citation; the central comparison follows from applying standard stochastic analysis tools to the new assumption class. The derivation remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The objective function is convex.
- domain assumption Stochastic gradients are unbiased estimates of the true gradient.
invented entities (1)
-
generalized Lipschitz functions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:1905.11881 , year=
Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881,
-
[4]
Near-optimal methods for minimizing star-convex functions and beyond
Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pages 1894–1938. PMLR,
work page 1938
-
[5]
Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity
Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtárik, Samuel Horváth, and Martin Takáˇc. Methods for convex (l_0, l_1)-smooth optimization: Clipping, acceleration, and adaptivity. arXiv preprint arXiv:2409.14989,
-
[6]
Less Regret via Online Conditioning
Matthew Streeter and H Brendan McMahan. Less regret via online conditioning. arXiv preprint arXiv:1002.4862,
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
-
[8]
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Dmitry Kovalev. Optimal projection-free adaptive sgd for matrix optimization. arXiv preprint arXiv:2604.02505,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020
Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395,
-
[10]
Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed
Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáˇc, and Eduard Gorbunov. Clipping improves adam-norm and adagrad- norm when the noise is heavy-tailed. arXiv preprint arXiv:2406.04443,
-
[11]
Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise
Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding adam optimizer via online learning of updates: Adam is ftrl in disguise. arXiv preprint arXiv:2402.01567,
-
[12]
Random scaling and momentum for non-smooth non-convex optimization
Qinzi Zhang and Ashok Cutkosky. Random scaling and momentum for non-smooth non-convex optimization. arXiv preprint arXiv:2405.09742,
-
[13]
SGD Converges to Global Minimum in Deep Learning via Star-convex Path
Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[14]
Reevaluating theoretical analysis methods for optimization in deep learning
Hoang Tran, Qinzi Zhang, and Ashok Cutkosky. Reevaluating theoretical analysis methods for optimization in deep learning. arXiv preprint arXiv:2407.01825,
-
[15]
Convergence of adam under relaxed assumptions
Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36:52166–52196, 2023a. Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems...
-
[16]
Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466,
-
[17]
Sgd with adaptive preconditioning: Unified analysis and momentum acceleration
Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803,
-
[18]
Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd
for more details on quasar-convex functions. Consequently, the function f(x) may no longer be subdifferentiable at every point x ∈ Rd. However, we can define the generalized subdifferential ∂◦f(x) = ∂f ◦(x; · ). Here, f ◦(x; w) is the generalized directional derivative [Clarke, 1990], which is a convex function with respect to the second argument and is d...
work page 1990
-
[19]
We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev
or low-rank gradients (Algorithm 4). We will provide a unified analysis based on the framework of Kovalev and Borodich [2025], Kovalev
work page 2025
-
[20]
Hence, by the Monotonicity Theorem [Hagood and Thomson, 2006], we conclude that the function r(t) is non-increasing, which implies r(t) ≤ 0 for all t ∈ R+. Next, since the function p(t) is continuous, the derivative ˙q(t) of the function q(t) exists and can be upper-bounded as follows: ˙q(t) (a) = (M0 + M1p(t))∥x′ − x∥2 ≤ (M0 + M1[p(0) + |p(t) − p(0)|])∥x...
work page 2006
-
[21]
(15); (d) uses the definition in eq
+ 2ηG2 0 (c) ≤ O h RG0√ K + 1 + [RG1]2F K + 1 i (d) ≤ O[ϵ], where (a) uses the convexity of the function f(x) and the definition of xK; (b) uses the inequality above and the fact that ∥x∗∥2 ≤ R and x0 = 0; (c) uses the definitions in eq. (15); (d) uses the definition in eq. (15), Remark 1, and the inequality in eq. (7). E.4 Proof of Theorem 4 Using the de...
work page 2010
-
[22]
Let p ∈ (0, 1/2], and let {∆k}k∈N ⊂ R+ and {δk}k∈N ⊂ R+ be two sequences of non-negative numbers satisfying the following condition for all k ∈ N: πk∆k ≤ δk + pγPk i=1αiπi∆i, (55) where αk, πk satisfy eq. (47). Then, the following inequalities hold for all k ∈ N: Pk i=1αiπi∆i ≤Pk i=1αiδi[πk/πi−1]p. (56) Proof. We define τk ≥ 0 for k ∈ N as follows: τk =Pk...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.