arxiv: 2605.05944 · v1 · submitted 2026-05-07 · 🧮 math.OC

Recognition: unknown

Universal Adaptive Proximal Gradient Methods via Gradient Mapping Accumulation

Zimeng Wang , Alp Yurtsever

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3

classification 🧮 math.OC

keywords adaptive proximal gradientgradient mapping accumulationuniversal convergencestochastic optimizationnonconvex optimizationaccelerated methodscomposite optimization

0 comments

The pith

A single adaptive proximal gradient method converges for nonconvex smooth, convex nonsmooth, and convex smooth problems by accumulating gradient mapping norms without parameter tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive proximal gradient algorithm whose step size is chosen by placing the sum of past gradient mapping norms in the denominator. This choice lets one algorithm handle the composite minimization of a simple convex function plus a term that is either nonconvex smooth, convex nonsmooth, or convex smooth. Convergence holds under only bounded-iterates and bounded-variance assumptions and recovers the rates of the classical proximal gradient method up to logarithmic factors in both deterministic and stochastic regimes. An accelerated variant is supplied for the convex setting that improves the smooth stochastic rate to order O(1/t^{2} + sigma/sqrt(t)). New noise-control arguments are introduced that apply uniformly across the three problem classes.

Core claim

By defining the step size as the reciprocal of the square root of the accumulated squared norms of gradient mappings, the algorithm guarantees that the same update rule produces convergent sequences for any of the three function classes without knowledge of Lipschitz constants, strong-convexity parameters, or variance bounds. The analysis shows that the resulting rates match those of the proximal gradient method up to logarithmic factors and that the same accumulation technique supplies new stochastic-noise bounds that simplify the proofs for all three classes simultaneously.

What carries the argument

The adaptive step-size rule that accumulates the squared Euclidean norms of historical gradient mappings in the denominator and uses their sum to scale the proximal step.

If this is right

The same algorithm recovers the standard proximal-gradient rate for nonconvex smooth problems in both deterministic and stochastic settings.
It recovers the standard rate for convex nonsmooth problems without any change to the update rule.
It recovers the standard rate for convex smooth problems and the accelerated variant improves the stochastic smooth rate to near-optimal order.
The noise-control arguments developed for the stochastic case apply uniformly to all three function classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The accumulation of gradient-mapping norms may serve as a model-free proxy for unknown problem constants in other first-order methods.
The uniform noise-handling technique could simplify convergence proofs for other adaptive proximal schemes that currently treat each function class separately.
Implementation of this rule requires only storage of a running sum of squared norms, which is inexpensive and easily added to existing proximal-gradient code.

Load-bearing premise

The sequence of iterates stays bounded and the stochastic noise has bounded variance at every iteration.

What would settle it

An explicit example in which iterates become unbounded or variance grows without bound and the method fails to converge would falsify the universal-convergence claim.

Figures

Figures reproduced from arXiv: 2605.05944 by Alp Yurtsever, Zimeng Wang.

**Figure 1.** Figure 1: Evolution of ∥G(x)∥ for solving the nonconvex SVM problem (6.1) on the MNIST dataset in the deterministic setting. Algorithms 1 and 2 are evaluated with fixed γ = 1 and three choices of η ∈ {1, 10, 100}, while the proximal gradient method (PG) is evaluated with constant step sizes ηk ≡ α ∈ {0.1/L, 1/L, 10/L}, where L is the smoothness constant of f. (a) ∥G(xt)∥ (b) 1 t Pt k=1 ∥G(xk)∥ (c) Test Accuracy view at source ↗

**Figure 2.** Figure 2: Effect of batch size on Algorithm 1 for solving the nonconvex SVM problem (6.1) on the a9a dataset. Here, b = 1% means the gradient batch size is 1% of the training set size. 6.2 Regularized Logistic Regression We now evaluate the performance of Algorithms 1 and 2 on a convex composite problem: binary logistic regression with ℓ1 regularization and box constraints, min x∈[−R,R] d F(x) := 1 n Xn i=1 log 1 + … view at source ↗

**Figure 3.** Figure 3: Comparison between Algorithms 1 and 2 for the regularized logistic regression problem (6.2) on the w6a dataset under different values of η. The averaged iterates are defined as ¯xt := 1 t Pt k=1 xk for Algorithm 1 and ¯yt := P 1 t k=1 αk Pt k=1 αkyk for Algorithm 2. of η ∈ {0.1, 1, 10, 100} for both algorithms. Recall that Theorem 4 establishes convergence at the averaged iterate for Algorithm 1, whereas T… view at source ↗

**Figure 4.** Figure 4: Comparison among Algorithms 1 and 2, UnixGrad, and AcceleGrad for solving the logistic regression problem (6.2) with λ = 0 on the connect4 dataset. 7 Conclusion We proposed an adaptive proximal gradient method for solving nonsmooth composite optimization problems, where one component is a simple convex function and the other belongs to one of three classes: nonconvex smooth, convex nonsmooth, or convex s… view at source ↗

read the original abstract

We propose an adaptive proximal gradient method for minimizing the sum of two functions, where one is a simple convex function, and the other belongs to one of the three classes: nonconvex smooth, convex nonsmooth, or convex smooth. The key feature of the method is an adaptive step size that accumulates historical gradient mapping norms in the denominator. Without any modification or knowledge of problem parameters, the method converges across all three problem classes under mild bounded-iterates and bounded-variance assumptions, with rates matching those of the proximal gradient method up to logarithmic factors, in both deterministic and stochastic settings. For the convex setting, we further propose an accelerated variant. It retains a similar near-optimal convergence rate for the nonsmooth case and achieves an improved rate of order $\widetilde{O}\big(1/t^2 + \sigma/\sqrt{t}\big)$ for the smooth case, which is optimal up to logarithmic factors. Notably, we develop new techniques for controlling the effect of stochastic noise, which are applicable across all three problem classes in the stochastic setting and enable simplified analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a parameter-free adaptive proximal gradient method that unifies convergence across nonconvex smooth, convex nonsmooth, and convex smooth classes using gradient mapping accumulation.

read the letter

The main point is that this work gives one proximal gradient algorithm whose step size adapts by accumulating squared norms of the gradient mappings from past steps. The same rule is claimed to work for nonconvex smooth, convex nonsmooth, and convex smooth objectives, delivering the usual proximal gradient rates up to logarithmic factors in both deterministic and stochastic regimes, without any knowledge of problem parameters like smoothness constants.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a single adaptive proximal gradient algorithm for min f(x) + g(x) (g convex and simple) in which step sizes are formed by accumulating squared norms of the gradient mapping in the denominator. The central claim is that this construction, without any problem-class-specific modification or knowledge of smoothness/convexity parameters, attains the standard proximal-gradient convergence rates (up to logarithmic factors) simultaneously for non-convex smooth, convex nonsmooth, and convex smooth objectives, in both deterministic and stochastic regimes, under only bounded-iterates and bounded-variance assumptions. An accelerated variant is also introduced for the convex setting that preserves near-optimal rates for the nonsmooth case and improves the smooth case to Õ(1/t² + σ/√t). New noise-control arguments are developed that apply uniformly across all three classes.

Significance. If the stated rates and uniform analysis hold, the work would constitute a meaningful advance in adaptive first-order methods by eliminating the need for case-by-case tuning or parameter knowledge while retaining (near-)optimal guarantees across disparate problem classes. The uniform stochastic-noise control technique is a clear technical strength that could simplify future analyses. The accelerated convex-smooth rate being optimal up to logs is also noteworthy. The explicit identification of the bounded-iterates hypothesis as necessary for the non-convex case is appropriately cautious.

minor comments (2)

[Abstract] Abstract: the claimed rates are described only qualitatively (“matching those of the proximal gradient method up to logarithmic factors”). Stating the precise rates (including the explicit log factors) already in the abstract would make the contribution immediately clearer to readers.
[Introduction] The bounded-iterates assumption is flagged as necessary for the non-convex case, but its precise quantitative role in controlling the accumulated denominator is not previewed; a one-sentence remark in the introduction would help.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of our manuscript. We appreciate the recognition of the uniform analysis across the three problem classes, the new noise-control techniques, and the near-optimal rates achieved by the accelerated variant. The recommendation for minor revision is noted, and we stand ready to incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines an adaptive proximal gradient method whose step sizes are formed by accumulating squared gradient-mapping norms in the denominator, then proves convergence rates (matching proximal gradient up to logarithmic factors) for non-convex smooth, convex nonsmooth, and convex smooth objectives under explicitly stated bounded-iterates and bounded-variance assumptions. The analysis introduces new noise-control arguments that apply uniformly across classes and regimes; no equation reduces a claimed rate to a fitted parameter, self-referential definition, or prior self-citation that itself depends on the target result. The bounded-iterates hypothesis is flagged as necessary for the non-convex case and is a standard external condition rather than an ansatz smuggled via citation. The construction and rates are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions stated in the abstract; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption bounded-iterates assumption
Invoked to guarantee convergence across all three problem classes in both deterministic and stochastic settings.
domain assumption bounded-variance assumption
Required for the stochastic setting to control noise effects.

pith-pipeline@v0.9.0 · 5479 in / 1265 out tokens · 60679 ms · 2026-05-08T08:23:22.748827+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

[1]

Attia and T

A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance . In International Conference on Machine Learning, pages 1147--1171. PMLR, 2023

2023
[2]

Beck and M

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2 0 (1): 0 183--202, 2009

2009
[3]

J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39 0 (227): 0 357--365, 1944

1944
[4]

Candes and B

E. Candes and B. Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55 0 (6): 0 111--119, 2012

2012
[5]

Chang and C

C. Chang and C. Lin. LIBSVM : A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2 0 (3): 0 27, 2011

2011
[6]

Cortes and V

C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20 0 (3): 0 273--297, 1995

1995
[7]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12: 0 2121--2159, 2011

2011
[8]

A. Ene, H. Nguyen, and A. Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314--7321, 2021

2021
[9]

M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD : Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pages 313--355. PMLR, 2022

2022
[10]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

2013
[11]

Ghadimi and G

S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156 0 (1): 0 59--99, 2016

2016
[12]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 0 (1-2): 0 267--305, 2016

2016
[13]

Joulani, A

P. Joulani, A. Raj, A. Gyorgy, and C. Szepesv \'a ri. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International Conference on Machine Learning, pages 4984--4993. PMLR, 2020

2020
[14]

Kavis, K

A. Kavis, K. Levy, F. Bach, and V. Cevher. UniXGrad : A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 32, 2019

2019
[15]

Kavis, K

A. Kavis, K. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations, 2022

2022
[16]

Kingma and J

D. Kingma and J. Ba. Adam : A method for stochastic optimization. In International Conference on Learning Representations, 2015

2015
[17]

G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133 0 (1-2): 0 365--397, 2012

2012
[18]

Latafat, A

P. Latafat, A. Themelis, and P. Patrinos. On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms. In 6th Annual Learning for Dynamics & Control Conference, pages 197--208. PMLR, 2024

2024
[19]

Latafat, A

P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. Mathematical Programming, 213 0 (1): 0 433--471, 2025

2025
[20]

LeCun, L

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 2002

2002
[21]

Levitin and B

E. Levitin and B. Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6 0 (5): 0 1--50, 1966

1966
[22]

K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. Advances in Neural Information Processing Systems, 30, 2017

2017
[23]

K. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. Advances in Neural Information Processing Systems, 31, 2018

2018
[24]

Lions and B

P. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16 0 (6): 0 964--979, 1979

1979
[25]

Malitsky and K

Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. In International Conference on Machine Learning, pages 6702--6712. PMLR, 2020

2020
[26]

Malitsky and K

Y. Malitsky and K. Mishchenko. Adaptive proximal gradient method for convex optimization. Advances in Neural Information Processing Systems, 37: 0 100670--100697, 2024

2024
[27]

Nesterov

Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) . In Dokl. Akad. Nauk SSSR, volume 269, pages 543--547, 1983

1983
[28]

Nesterov

Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2003

2003
[29]

Nesterov

Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140 0 (1): 0 125--161, 2013

2013
[30]

Nesterov

Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 0 (1-2): 0 381--404, 2015

2015
[31]

Parikh and S

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1 0 (3): 0 127--239, 2014

2014
[32]

G. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72 0 (2): 0 383--390, 1979

1979
[33]

Reddi, S

S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018

2018
[34]

Rodomanov, X

A. Rodomanov, X. Jiang, and S. Stich. Universality of AdaGrad stepsizes for stochastic optimization: Inexact oracle, acceleration and variance reduction. Advances in Neural Information Processing Systems, 37: 0 26770--26813, 2024

2024
[35]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267--288, 1996

1996
[36]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5---RmsProp : Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

2012
[37]

P. Tseng. A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38 0 (2): 0 431--446, 2000

2000
[38]

B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161--190. PMLR, 2023 a

2023
[39]

J. Wang, X. Wang, and L. Zhang. Stochastic regularized Newton methods for nonlinear equations. Journal of Scientific Computing, 94 0 (3): 0 51, 2023 b

2023
[40]

X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi- Newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017

2017
[41]

R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21: 0 1--30, 2020

2020
[42]

J. Yun, A. Lozano, and E. Yang. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34: 0 24365--24378, 2021

2021
[43]

M. Zeiler. ADADELTA : An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012

work page arXiv 2012