AdaGrad does not adapt to H\"older-smoothness for composite objectives
Pith reviewed 2026-06-30 05:35 UTC · model grok-4.3
The pith
AdaGrad fails to achieve the expected convergence rate for Hölder-smooth composite problems because its gradient accumulation keeps shrinking stepsizes even when the smooth term's gradient stays nonzero at the solution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate O(n^{-(1+ν)/2}) associated with Hölder-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly.
What carries the argument
AdaGrad's accumulation of squared gradient norms, which drives stepsize reduction independently of whether the smooth gradient vanishes at the composite minimizer.
If this is right
- AdaGrad converges more slowly than the classical Hölder rate on the constructed composite problem.
- Accumulation based on gradient mappings avoids the excessive stepsize reduction.
- Accumulation based on successive gradient differences also avoids the pathology.
Where Pith is reading between the lines
- The same accumulation mismatch could appear in higher-dimensional composite problems with non-vanishing smooth gradients.
- Other diagonal adaptive methods that rely on similar squared-gradient accumulations may inherit the same limitation.
- Composite-aware stepsize rules might need to incorporate information from the nonsmooth term explicitly.
Load-bearing premise
The counterexample is a convex composite problem in which the gradient of the smooth term remains nonzero at the minimizer.
What would settle it
Run the AdaGrad update on the paper's explicit one-dimensional example and measure whether the distance to the minimizer decays at rate O(n^{-(1+ν)/2}) or slower.
Figures
read the original abstract
We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate $\mathcal{O}(n^{-(1+\nu)/2})$ associated with H\"older-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly. We also discuss why alternative accumulation mechanisms based on gradient mappings or on successive gradient differences, avoid this pathology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs a simple deterministic one-dimensional convex composite optimization problem to demonstrate that the standard AdaGrad algorithm fails to achieve the expected convergence rate O(n^{-(1+ν)/2}) for Hölder-smooth objectives with exponent ν. The central reason is that the gradient of the smooth term does not vanish at the minimizer, leading to continued excessive step-size shrinkage; the manuscript also discusses why gradient-mapping or successive-difference accumulations avoid this issue.
Significance. If the explicit construction holds, the result is significant for clarifying a basic mismatch between AdaGrad's step-size accumulation and composite optimality conditions under Hölder smoothness. The paper receives credit for supplying a deterministic, parameter-free counterexample together with a clear mechanistic explanation (non-vanishing smooth gradient) and for contrasting it with alternative accumulation rules that restore the expected rate.
minor comments (1)
- The abstract could briefly name the concrete functions used in the 1-D example to let readers immediately check the Hölder condition and the non-vanishing-gradient property.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript and for recommending acceptance. The referee's description of the contribution is accurate.
Circularity Check
No significant circularity detected
full rationale
The paper constructs an explicit deterministic 1D convex composite optimization problem as a counterexample showing that standard AdaGrad fails to attain the O(n^{-(1+ν)/2}) rate for Hölder-smooth objectives when the smooth-term gradient does not vanish at the minimizer. This is a direct existence proof via problem definition and rate calculation, with no fitted parameters renamed as predictions, no self-citation chains, no ansatz smuggling, and no self-definitional reductions. The claim is self-contained against the external benchmark rate and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The objective is convex and the smooth part has Hölder-continuous gradient with exponent ν
- domain assumption The gradient of the smooth term need not vanish at the composite minimizer
Reference graph
Works this paper leans on
-
[1]
Bojovic, Matia and Salzo, Saverio and Pontil, Massimiliano , journal=. Ada
-
[2]
Journal of Machine Learning Research , year =
John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =
-
[3]
Universal
Wang, Zimeng and Yurtsever, Alp , journal=. Universal
-
[4]
Journal of Machine Learning Research , volume=
Adagrad stepsizes: Sharp convergence over nonconvex landscapes , author=. Journal of Machine Learning Research , volume=
-
[5]
Conference on Learning Theory , pages=
The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance , author=. Conference on Learning Theory , pages=. 2022 , organization=
2022
-
[6]
arXiv preprint arXiv:2308.05621 , year=
Normalized gradients for all , author=. arXiv preprint arXiv:2308.05621 , year=
-
[7]
Advances in neural information processing systems , volume=
Online adaptive methods, universality and acceleration , author=. Advances in neural information processing systems , volume=
-
[8]
ACM/IMS Journal of Data Science , volume=
Adaptive bilevel optimization , author=. ACM/IMS Journal of Data Science , volume=. 2025 , publisher=
2025
-
[9]
Mathematical Programming , volume=
Universal gradient methods for convex optimization problems , author=. Mathematical Programming , volume=. 2015 , publisher=
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.