AdaGrad does not adapt to H\"older-smoothness for composite objectives

Massimiliano Pontil; Matia Bojovic; Saverio Salzo

arxiv: 2606.29893 · v1 · pith:LPXYKGRJnew · submitted 2026-06-29 · 🧮 math.OC · stat.ML

AdaGrad does not adapt to H\"older-smoothness for composite objectives

Matia Bojovic , Saverio Salzo , Massimiliano Pontil This is my paper

Pith reviewed 2026-06-30 05:35 UTC · model grok-4.3

classification 🧮 math.OC stat.ML

keywords AdaGradHölder smoothnesscomposite optimizationconvergence ratesadaptive methodsconvex optimizationstepsize adaptation

0 comments

The pith

AdaGrad fails to achieve the expected convergence rate for Hölder-smooth composite problems because its gradient accumulation keeps shrinking stepsizes even when the smooth term's gradient stays nonzero at the solution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a simple one-dimensional convex composite optimization problem and shows that standard AdaGrad does not attain the classical rate O(n^{-(1+ν)/2}) that holds for Hölder-smooth objectives. The mismatch arises because the gradient of the smooth part does not vanish at the minimizer, so the accumulated squared gradients continue to grow and force ever-smaller steps. The authors contrast this with alternative accumulations based on gradient mappings or successive differences, which avoid the excessive shrinkage. A sympathetic reader sees the result as evidence that classical AdaGrad accumulation is not automatically compatible with composite optimality conditions.

Core claim

We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate O(n^{-(1+ν)/2}) associated with Hölder-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly.

What carries the argument

AdaGrad's accumulation of squared gradient norms, which drives stepsize reduction independently of whether the smooth gradient vanishes at the composite minimizer.

If this is right

AdaGrad converges more slowly than the classical Hölder rate on the constructed composite problem.
Accumulation based on gradient mappings avoids the excessive stepsize reduction.
Accumulation based on successive gradient differences also avoids the pathology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same accumulation mismatch could appear in higher-dimensional composite problems with non-vanishing smooth gradients.
Other diagonal adaptive methods that rely on similar squared-gradient accumulations may inherit the same limitation.
Composite-aware stepsize rules might need to incorporate information from the nonsmooth term explicitly.

Load-bearing premise

The counterexample is a convex composite problem in which the gradient of the smooth term remains nonzero at the minimizer.

What would settle it

Run the AdaGrad update on the paper's explicit one-dimensional example and measure whether the distance to the minimizer decays at rate O(n^{-(1+ν)/2}) or slower.

Figures

Figures reproduced from arXiv: 2606.29893 by Massimiliano Pontil, Matia Bojovic, Saverio Salzo.

**Figure 1.** Figure 1: Left: f(x) = gp,c(x) − x. Right: f ′ (x) = g ′ p,c(x) − 1, for p = 4, 6, 8 and c = 0.8. Related works. The issue of the accumulation mechanism of the gradients in AdaGrad has already been considered in the literature. Indeed, there exist works proposing different accumulation schemes. Antonakopoulos et al. [2025], Wang and Yurtsever [2026] analyze an AdaGrad-type algorithm, which accumulates gradient mappi… view at source ↗

read the original abstract

We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate $\mathcal{O}(n^{-(1+\nu)/2})$ associated with H\"older-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly. We also discuss why alternative accumulation mechanisms based on gradient mappings or on successive gradient differences, avoid this pathology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean 1D counterexample where AdaGrad misses the expected Hölder rate on composite problems because the smooth gradient stays nonzero at the minimizer.

read the letter

The core takeaway is that standard AdaGrad accumulation fails to deliver the O(n^{-(1+ν)/2}) rate on a simple convex composite problem in one dimension. The smooth term's gradient does not go to zero at the solution, so the sum of squared gradients keeps increasing and the effective stepsize shrinks too aggressively.

The construction is new in its explicit focus on the composite case and the non-vanishing gradient property. Prior analyses often assumed vanishing gradients or treated the smooth and nonsmooth parts differently. The paper does a good job keeping the example deterministic and low-dimensional, which makes the mismatch transparent. The discussion of gradient-mapping or difference-based alternatives is also useful; it shows a practical fix without overclaiming.

The main soft spot is that the abstract only states existence, so the actual functions, the verification that the Hölder condition holds with the claimed exponent, and the precise rate calculation need to be checked in the full text. If those details are explicit and correct, the claim holds; nothing in the high-level argument looks circular or fitted. No other obvious gaps appear.

This is for people working on adaptive first-order methods and their analysis for composite or Hölder-smooth objectives. A reader who wants to understand when accumulation rules break would get direct value from the counterexample.

I would send it to peer review. The negative result is concrete enough and the issue it raises matters for algorithm selection.

Referee Report

0 major / 1 minor

Summary. The paper constructs a simple deterministic one-dimensional convex composite optimization problem to demonstrate that the standard AdaGrad algorithm fails to achieve the expected convergence rate O(n^{-(1+ν)/2}) for Hölder-smooth objectives with exponent ν. The central reason is that the gradient of the smooth term does not vanish at the minimizer, leading to continued excessive step-size shrinkage; the manuscript also discusses why gradient-mapping or successive-difference accumulations avoid this issue.

Significance. If the explicit construction holds, the result is significant for clarifying a basic mismatch between AdaGrad's step-size accumulation and composite optimality conditions under Hölder smoothness. The paper receives credit for supplying a deterministic, parameter-free counterexample together with a clear mechanistic explanation (non-vanishing smooth gradient) and for contrasting it with alternative accumulation rules that restore the expected rate.

minor comments (1)

The abstract could briefly name the concrete functions used in the 1-D example to let readers immediately check the Hölder condition and the non-vanishing-gradient property.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript and for recommending acceptance. The referee's description of the contribution is accurate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs an explicit deterministic 1D convex composite optimization problem as a counterexample showing that standard AdaGrad fails to attain the O(n^{-(1+ν)/2}) rate for Hölder-smooth objectives when the smooth-term gradient does not vanish at the minimizer. This is a direct existence proof via problem definition and rate calculation, with no fitted parameters renamed as predictions, no self-citation chains, no ansatz smuggling, and no self-definitional reductions. The claim is self-contained against the external benchmark rate and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on standard convex optimization and Hölder continuity assumptions plus the explicit construction of a 1D instance where the smooth gradient remains nonzero at optimality; no free parameters or invented entities are introduced.

axioms (2)

standard math The objective is convex and the smooth part has Hölder-continuous gradient with exponent ν
Invoked to invoke the classical rate O(n^{-(1+ν)/2}) as benchmark
domain assumption The gradient of the smooth term need not vanish at the composite minimizer
Central to the pathology described in the abstract

pith-pipeline@v0.9.1-grok · 5626 in / 1278 out tokens · 41604 ms · 2026-06-30T05:35:02.877673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages

[1]

Bojovic, Matia and Salzo, Saverio and Pontil, Massimiliano , journal=. Ada
[2]

Journal of Machine Learning Research , year =

John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =
[3]

Universal

Wang, Zimeng and Yurtsever, Alp , journal=. Universal
[4]

Journal of Machine Learning Research , volume=

Adagrad stepsizes: Sharp convergence over nonconvex landscapes , author=. Journal of Machine Learning Research , volume=
[5]

Conference on Learning Theory , pages=

The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022
[6]

arXiv preprint arXiv:2308.05621 , year=

Normalized gradients for all , author=. arXiv preprint arXiv:2308.05621 , year=

work page arXiv
[7]

Advances in neural information processing systems , volume=

Online adaptive methods, universality and acceleration , author=. Advances in neural information processing systems , volume=
[8]

ACM/IMS Journal of Data Science , volume=

Adaptive bilevel optimization , author=. ACM/IMS Journal of Data Science , volume=. 2025 , publisher=

2025
[9]

Mathematical Programming , volume=

Universal gradient methods for convex optimization problems , author=. Mathematical Programming , volume=. 2015 , publisher=

2015

[1] [1]

Bojovic, Matia and Salzo, Saverio and Pontil, Massimiliano , journal=. Ada

[2] [2]

Journal of Machine Learning Research , year =

John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =

[3] [3]

Universal

Wang, Zimeng and Yurtsever, Alp , journal=. Universal

[4] [4]

Journal of Machine Learning Research , volume=

Adagrad stepsizes: Sharp convergence over nonconvex landscapes , author=. Journal of Machine Learning Research , volume=

[5] [5]

Conference on Learning Theory , pages=

The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022

[6] [6]

arXiv preprint arXiv:2308.05621 , year=

Normalized gradients for all , author=. arXiv preprint arXiv:2308.05621 , year=

work page arXiv

[7] [7]

Advances in neural information processing systems , volume=

Online adaptive methods, universality and acceleration , author=. Advances in neural information processing systems , volume=

[8] [8]

ACM/IMS Journal of Data Science , volume=

Adaptive bilevel optimization , author=. ACM/IMS Journal of Data Science , volume=. 2025 , publisher=

2025

[9] [9]

Mathematical Programming , volume=

Universal gradient methods for convex optimization problems , author=. Mathematical Programming , volume=. 2015 , publisher=

2015