Recognition: unknown
Universal Adaptive Proximal Gradient Methods via Gradient Mapping Accumulation
Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3
The pith
A single adaptive proximal gradient method converges for nonconvex smooth, convex nonsmooth, and convex smooth problems by accumulating gradient mapping norms without parameter tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining the step size as the reciprocal of the square root of the accumulated squared norms of gradient mappings, the algorithm guarantees that the same update rule produces convergent sequences for any of the three function classes without knowledge of Lipschitz constants, strong-convexity parameters, or variance bounds. The analysis shows that the resulting rates match those of the proximal gradient method up to logarithmic factors and that the same accumulation technique supplies new stochastic-noise bounds that simplify the proofs for all three classes simultaneously.
What carries the argument
The adaptive step-size rule that accumulates the squared Euclidean norms of historical gradient mappings in the denominator and uses their sum to scale the proximal step.
If this is right
- The same algorithm recovers the standard proximal-gradient rate for nonconvex smooth problems in both deterministic and stochastic settings.
- It recovers the standard rate for convex nonsmooth problems without any change to the update rule.
- It recovers the standard rate for convex smooth problems and the accelerated variant improves the stochastic smooth rate to near-optimal order.
- The noise-control arguments developed for the stochastic case apply uniformly to all three function classes.
Where Pith is reading between the lines
- The accumulation of gradient-mapping norms may serve as a model-free proxy for unknown problem constants in other first-order methods.
- The uniform noise-handling technique could simplify convergence proofs for other adaptive proximal schemes that currently treat each function class separately.
- Implementation of this rule requires only storage of a running sum of squared norms, which is inexpensive and easily added to existing proximal-gradient code.
Load-bearing premise
The sequence of iterates stays bounded and the stochastic noise has bounded variance at every iteration.
What would settle it
An explicit example in which iterates become unbounded or variance grows without bound and the method fails to converge would falsify the universal-convergence claim.
Figures
read the original abstract
We propose an adaptive proximal gradient method for minimizing the sum of two functions, where one is a simple convex function, and the other belongs to one of the three classes: nonconvex smooth, convex nonsmooth, or convex smooth. The key feature of the method is an adaptive step size that accumulates historical gradient mapping norms in the denominator. Without any modification or knowledge of problem parameters, the method converges across all three problem classes under mild bounded-iterates and bounded-variance assumptions, with rates matching those of the proximal gradient method up to logarithmic factors, in both deterministic and stochastic settings. For the convex setting, we further propose an accelerated variant. It retains a similar near-optimal convergence rate for the nonsmooth case and achieves an improved rate of order $\widetilde{O}\big(1/t^2 + \sigma/\sqrt{t}\big)$ for the smooth case, which is optimal up to logarithmic factors. Notably, we develop new techniques for controlling the effect of stochastic noise, which are applicable across all three problem classes in the stochastic setting and enable simplified analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a single adaptive proximal gradient algorithm for min f(x) + g(x) (g convex and simple) in which step sizes are formed by accumulating squared norms of the gradient mapping in the denominator. The central claim is that this construction, without any problem-class-specific modification or knowledge of smoothness/convexity parameters, attains the standard proximal-gradient convergence rates (up to logarithmic factors) simultaneously for non-convex smooth, convex nonsmooth, and convex smooth objectives, in both deterministic and stochastic regimes, under only bounded-iterates and bounded-variance assumptions. An accelerated variant is also introduced for the convex setting that preserves near-optimal rates for the nonsmooth case and improves the smooth case to Õ(1/t² + σ/√t). New noise-control arguments are developed that apply uniformly across all three classes.
Significance. If the stated rates and uniform analysis hold, the work would constitute a meaningful advance in adaptive first-order methods by eliminating the need for case-by-case tuning or parameter knowledge while retaining (near-)optimal guarantees across disparate problem classes. The uniform stochastic-noise control technique is a clear technical strength that could simplify future analyses. The accelerated convex-smooth rate being optimal up to logs is also noteworthy. The explicit identification of the bounded-iterates hypothesis as necessary for the non-convex case is appropriately cautious.
minor comments (2)
- [Abstract] Abstract: the claimed rates are described only qualitatively (“matching those of the proximal gradient method up to logarithmic factors”). Stating the precise rates (including the explicit log factors) already in the abstract would make the contribution immediately clearer to readers.
- [Introduction] The bounded-iterates assumption is flagged as necessary for the non-convex case, but its precise quantitative role in controlling the accumulated denominator is not previewed; a one-sentence remark in the introduction would help.
Simulated Author's Rebuttal
We thank the referee for their careful reading and positive assessment of our manuscript. We appreciate the recognition of the uniform analysis across the three problem classes, the new noise-control techniques, and the near-optimal rates achieved by the accelerated variant. The recommendation for minor revision is noted, and we stand ready to incorporate any editorial or minor clarifications in the revised version.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines an adaptive proximal gradient method whose step sizes are formed by accumulating squared gradient-mapping norms in the denominator, then proves convergence rates (matching proximal gradient up to logarithmic factors) for non-convex smooth, convex nonsmooth, and convex smooth objectives under explicitly stated bounded-iterates and bounded-variance assumptions. The analysis introduces new noise-control arguments that apply uniformly across classes and regimes; no equation reduces a claimed rate to a fitted parameter, self-referential definition, or prior self-citation that itself depends on the target result. The bounded-iterates hypothesis is flagged as necessary for the non-convex case and is a standard external condition rather than an ansatz smuggled via citation. The construction and rates are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption bounded-iterates assumption
- domain assumption bounded-variance assumption
Reference graph
Works this paper leans on
-
[1]
Attia and T
A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance . In International Conference on Machine Learning, pages 1147--1171. PMLR, 2023
2023
-
[2]
Beck and M
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2 0 (1): 0 183--202, 2009
2009
-
[3]
J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39 0 (227): 0 357--365, 1944
1944
-
[4]
Candes and B
E. Candes and B. Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55 0 (6): 0 111--119, 2012
2012
-
[5]
Chang and C
C. Chang and C. Lin. LIBSVM : A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2 0 (3): 0 27, 2011
2011
-
[6]
Cortes and V
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20 0 (3): 0 273--297, 1995
1995
-
[7]
Duchi, E
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12: 0 2121--2159, 2011
2011
-
[8]
A. Ene, H. Nguyen, and A. Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314--7321, 2021
2021
-
[9]
M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD : Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pages 313--355. PMLR, 2022
2022
-
[10]
Ghadimi and G
S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013
2013
-
[11]
Ghadimi and G
S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156 0 (1): 0 59--99, 2016
2016
-
[12]
Ghadimi, G
S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 0 (1-2): 0 267--305, 2016
2016
-
[13]
Joulani, A
P. Joulani, A. Raj, A. Gyorgy, and C. Szepesv \'a ri. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International Conference on Machine Learning, pages 4984--4993. PMLR, 2020
2020
-
[14]
Kavis, K
A. Kavis, K. Levy, F. Bach, and V. Cevher. UniXGrad : A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 32, 2019
2019
-
[15]
Kavis, K
A. Kavis, K. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations, 2022
2022
-
[16]
Kingma and J
D. Kingma and J. Ba. Adam : A method for stochastic optimization. In International Conference on Learning Representations, 2015
2015
-
[17]
G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133 0 (1-2): 0 365--397, 2012
2012
-
[18]
Latafat, A
P. Latafat, A. Themelis, and P. Patrinos. On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms. In 6th Annual Learning for Dynamics & Control Conference, pages 197--208. PMLR, 2024
2024
-
[19]
Latafat, A
P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. Mathematical Programming, 213 0 (1): 0 433--471, 2025
2025
-
[20]
LeCun, L
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 2002
2002
-
[21]
Levitin and B
E. Levitin and B. Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6 0 (5): 0 1--50, 1966
1966
-
[22]
K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. Advances in Neural Information Processing Systems, 30, 2017
2017
-
[23]
K. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. Advances in Neural Information Processing Systems, 31, 2018
2018
-
[24]
Lions and B
P. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16 0 (6): 0 964--979, 1979
1979
-
[25]
Malitsky and K
Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. In International Conference on Machine Learning, pages 6702--6712. PMLR, 2020
2020
-
[26]
Malitsky and K
Y. Malitsky and K. Mishchenko. Adaptive proximal gradient method for convex optimization. Advances in Neural Information Processing Systems, 37: 0 100670--100697, 2024
2024
-
[27]
Nesterov
Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) . In Dokl. Akad. Nauk SSSR, volume 269, pages 543--547, 1983
1983
-
[28]
Nesterov
Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2003
2003
-
[29]
Nesterov
Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140 0 (1): 0 125--161, 2013
2013
-
[30]
Nesterov
Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 0 (1-2): 0 381--404, 2015
2015
-
[31]
Parikh and S
N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1 0 (3): 0 127--239, 2014
2014
-
[32]
G. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72 0 (2): 0 383--390, 1979
1979
-
[33]
Reddi, S
S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018
2018
-
[34]
Rodomanov, X
A. Rodomanov, X. Jiang, and S. Stich. Universality of AdaGrad stepsizes for stochastic optimization: Inexact oracle, acceleration and variance reduction. Advances in Neural Information Processing Systems, 37: 0 26770--26813, 2024
2024
-
[35]
Tibshirani
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267--288, 1996
1996
-
[36]
Tieleman and G
T. Tieleman and G. Hinton. Lecture 6.5---RmsProp : Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012
2012
-
[37]
P. Tseng. A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38 0 (2): 0 431--446, 2000
2000
-
[38]
B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161--190. PMLR, 2023 a
2023
-
[39]
J. Wang, X. Wang, and L. Zhang. Stochastic regularized Newton methods for nonlinear equations. Journal of Scientific Computing, 94 0 (3): 0 51, 2023 b
2023
-
[40]
X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi- Newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017
2017
-
[41]
R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21: 0 1--30, 2020
2020
-
[42]
J. Yun, A. Lozano, and E. Yang. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34: 0 24365--24378, 2021
2021
- [43]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.