pith. sign in

arxiv: 2606.29893 · v1 · pith:LPXYKGRJnew · submitted 2026-06-29 · 🧮 math.OC · stat.ML

AdaGrad does not adapt to H\"older-smoothness for composite objectives

Pith reviewed 2026-06-30 05:35 UTC · model grok-4.3

classification 🧮 math.OC stat.ML
keywords AdaGradHölder smoothnesscomposite optimizationconvergence ratesadaptive methodsconvex optimizationstepsize adaptation
0
0 comments X

The pith

AdaGrad fails to achieve the expected convergence rate for Hölder-smooth composite problems because its gradient accumulation keeps shrinking stepsizes even when the smooth term's gradient stays nonzero at the solution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a simple one-dimensional convex composite optimization problem and shows that standard AdaGrad does not attain the classical rate O(n^{-(1+ν)/2}) that holds for Hölder-smooth objectives. The mismatch arises because the gradient of the smooth part does not vanish at the minimizer, so the accumulated squared gradients continue to grow and force ever-smaller steps. The authors contrast this with alternative accumulations based on gradient mappings or successive differences, which avoid the excessive shrinkage. A sympathetic reader sees the result as evidence that classical AdaGrad accumulation is not automatically compatible with composite optimality conditions.

Core claim

We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate O(n^{-(1+ν)/2}) associated with Hölder-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly.

What carries the argument

AdaGrad's accumulation of squared gradient norms, which drives stepsize reduction independently of whether the smooth gradient vanishes at the composite minimizer.

If this is right

  • AdaGrad converges more slowly than the classical Hölder rate on the constructed composite problem.
  • Accumulation based on gradient mappings avoids the excessive stepsize reduction.
  • Accumulation based on successive gradient differences also avoids the pathology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same accumulation mismatch could appear in higher-dimensional composite problems with non-vanishing smooth gradients.
  • Other diagonal adaptive methods that rely on similar squared-gradient accumulations may inherit the same limitation.
  • Composite-aware stepsize rules might need to incorporate information from the nonsmooth term explicitly.

Load-bearing premise

The counterexample is a convex composite problem in which the gradient of the smooth term remains nonzero at the minimizer.

What would settle it

Run the AdaGrad update on the paper's explicit one-dimensional example and measure whether the distance to the minimizer decays at rate O(n^{-(1+ν)/2}) or slower.

Figures

Figures reproduced from arXiv: 2606.29893 by Massimiliano Pontil, Matia Bojovic, Saverio Salzo.

Figure 1
Figure 1. Figure 1: Left: f(x) = gp,c(x) − x. Right: f ′ (x) = g ′ p,c(x) − 1, for p = 4, 6, 8 and c = 0.8. Related works. The issue of the accumulation mechanism of the gradients in AdaGrad has already been considered in the literature. Indeed, there exist works proposing different accumulation schemes. Antonakopoulos et al. [2025], Wang and Yurtsever [2026] analyze an AdaGrad-type algorithm, which accumulates gradient mappi… view at source ↗
read the original abstract

We exhibit a simple deterministic one-dimensional convex composite optimization problem for which AdaGrad scheme does not achieve the classical convergence rate $\mathcal{O}(n^{-(1+\nu)/2})$ associated with H\"older-smooth objectives. The example highlights a basic mismatch between classical AdaGrad accumulation and composite optimality. A main insight is that the gradient of the smooth term may not vanish at the optimum, causing AdaGrad to keep reducing its stepsize excessively and converge more slowly. We also discuss why alternative accumulation mechanisms based on gradient mappings or on successive gradient differences, avoid this pathology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper constructs a simple deterministic one-dimensional convex composite optimization problem to demonstrate that the standard AdaGrad algorithm fails to achieve the expected convergence rate O(n^{-(1+ν)/2}) for Hölder-smooth objectives with exponent ν. The central reason is that the gradient of the smooth term does not vanish at the minimizer, leading to continued excessive step-size shrinkage; the manuscript also discusses why gradient-mapping or successive-difference accumulations avoid this issue.

Significance. If the explicit construction holds, the result is significant for clarifying a basic mismatch between AdaGrad's step-size accumulation and composite optimality conditions under Hölder smoothness. The paper receives credit for supplying a deterministic, parameter-free counterexample together with a clear mechanistic explanation (non-vanishing smooth gradient) and for contrasting it with alternative accumulation rules that restore the expected rate.

minor comments (1)
  1. The abstract could briefly name the concrete functions used in the 1-D example to let readers immediately check the Hölder condition and the non-vanishing-gradient property.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript and for recommending acceptance. The referee's description of the contribution is accurate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs an explicit deterministic 1D convex composite optimization problem as a counterexample showing that standard AdaGrad fails to attain the O(n^{-(1+ν)/2}) rate for Hölder-smooth objectives when the smooth-term gradient does not vanish at the minimizer. This is a direct existence proof via problem definition and rate calculation, with no fitted parameters renamed as predictions, no self-citation chains, no ansatz smuggling, and no self-definitional reductions. The claim is self-contained against the external benchmark rate and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on standard convex optimization and Hölder continuity assumptions plus the explicit construction of a 1D instance where the smooth gradient remains nonzero at optimality; no free parameters or invented entities are introduced.

axioms (2)
  • standard math The objective is convex and the smooth part has Hölder-continuous gradient with exponent ν
    Invoked to invoke the classical rate O(n^{-(1+ν)/2}) as benchmark
  • domain assumption The gradient of the smooth term need not vanish at the composite minimizer
    Central to the pathology described in the abstract

pith-pipeline@v0.9.1-grok · 5626 in / 1278 out tokens · 41604 ms · 2026-06-30T05:35:02.877673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages

  1. [1]

    Bojovic, Matia and Salzo, Saverio and Pontil, Massimiliano , journal=. Ada

  2. [2]

    Journal of Machine Learning Research , year =

    John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =

  3. [3]

    Universal

    Wang, Zimeng and Yurtsever, Alp , journal=. Universal

  4. [4]

    Journal of Machine Learning Research , volume=

    Adagrad stepsizes: Sharp convergence over nonconvex landscapes , author=. Journal of Machine Learning Research , volume=

  5. [5]

    Conference on Learning Theory , pages=

    The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance , author=. Conference on Learning Theory , pages=. 2022 , organization=

  6. [6]

    arXiv preprint arXiv:2308.05621 , year=

    Normalized gradients for all , author=. arXiv preprint arXiv:2308.05621 , year=

  7. [7]

    Advances in neural information processing systems , volume=

    Online adaptive methods, universality and acceleration , author=. Advances in neural information processing systems , volume=

  8. [8]

    ACM/IMS Journal of Data Science , volume=

    Adaptive bilevel optimization , author=. ACM/IMS Journal of Data Science , volume=. 2025 , publisher=

  9. [9]

    Mathematical Programming , volume=

    Universal gradient methods for convex optimization problems , author=. Mathematical Programming , volume=. 2015 , publisher=