pith. machine review for the scientific record. sign in

arxiv: 2605.05944 · v1 · submitted 2026-05-07 · 🧮 math.OC

Recognition: unknown

Universal Adaptive Proximal Gradient Methods via Gradient Mapping Accumulation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3

classification 🧮 math.OC
keywords adaptive proximal gradientgradient mapping accumulationuniversal convergencestochastic optimizationnonconvex optimizationaccelerated methodscomposite optimization
0
0 comments X

The pith

A single adaptive proximal gradient method converges for nonconvex smooth, convex nonsmooth, and convex smooth problems by accumulating gradient mapping norms without parameter tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive proximal gradient algorithm whose step size is chosen by placing the sum of past gradient mapping norms in the denominator. This choice lets one algorithm handle the composite minimization of a simple convex function plus a term that is either nonconvex smooth, convex nonsmooth, or convex smooth. Convergence holds under only bounded-iterates and bounded-variance assumptions and recovers the rates of the classical proximal gradient method up to logarithmic factors in both deterministic and stochastic regimes. An accelerated variant is supplied for the convex setting that improves the smooth stochastic rate to order O(1/t^{2} + sigma/sqrt(t)). New noise-control arguments are introduced that apply uniformly across the three problem classes.

Core claim

By defining the step size as the reciprocal of the square root of the accumulated squared norms of gradient mappings, the algorithm guarantees that the same update rule produces convergent sequences for any of the three function classes without knowledge of Lipschitz constants, strong-convexity parameters, or variance bounds. The analysis shows that the resulting rates match those of the proximal gradient method up to logarithmic factors and that the same accumulation technique supplies new stochastic-noise bounds that simplify the proofs for all three classes simultaneously.

What carries the argument

The adaptive step-size rule that accumulates the squared Euclidean norms of historical gradient mappings in the denominator and uses their sum to scale the proximal step.

If this is right

  • The same algorithm recovers the standard proximal-gradient rate for nonconvex smooth problems in both deterministic and stochastic settings.
  • It recovers the standard rate for convex nonsmooth problems without any change to the update rule.
  • It recovers the standard rate for convex smooth problems and the accelerated variant improves the stochastic smooth rate to near-optimal order.
  • The noise-control arguments developed for the stochastic case apply uniformly to all three function classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The accumulation of gradient-mapping norms may serve as a model-free proxy for unknown problem constants in other first-order methods.
  • The uniform noise-handling technique could simplify convergence proofs for other adaptive proximal schemes that currently treat each function class separately.
  • Implementation of this rule requires only storage of a running sum of squared norms, which is inexpensive and easily added to existing proximal-gradient code.

Load-bearing premise

The sequence of iterates stays bounded and the stochastic noise has bounded variance at every iteration.

What would settle it

An explicit example in which iterates become unbounded or variance grows without bound and the method fails to converge would falsify the universal-convergence claim.

Figures

Figures reproduced from arXiv: 2605.05944 by Alp Yurtsever, Zimeng Wang.

Figure 1
Figure 1. Figure 1: Evolution of ∥G(x)∥ for solving the nonconvex SVM problem (6.1) on the MNIST dataset in the deterministic setting. Algorithms 1 and 2 are evaluated with fixed γ = 1 and three choices of η ∈ {1, 10, 100}, while the proximal gradient method (PG) is evaluated with constant step sizes ηk ≡ α ∈ {0.1/L, 1/L, 10/L}, where L is the smoothness constant of f. (a) ∥G(xt)∥ (b) 1 t Pt k=1 ∥G(xk)∥ (c) Test Accuracy view at source ↗
Figure 2
Figure 2. Figure 2: Effect of batch size on Algorithm 1 for solving the nonconvex SVM problem (6.1) on the a9a dataset. Here, b = 1% means the gradient batch size is 1% of the training set size. 6.2 Regularized Logistic Regression We now evaluate the performance of Algorithms 1 and 2 on a convex composite problem: binary logistic regression with ℓ1 regularization and box constraints, min x∈[−R,R] d F(x) := 1 n Xn i=1 log 1 + … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between Algorithms 1 and 2 for the regularized logistic regression problem (6.2) on the w6a dataset under different values of η. The averaged iterates are defined as ¯xt := 1 t Pt k=1 xk for Algorithm 1 and ¯yt := P 1 t k=1 αk Pt k=1 αkyk for Algorithm 2. of η ∈ {0.1, 1, 10, 100} for both algorithms. Recall that Theorem 4 establishes convergence at the averaged iterate for Algorithm 1, whereas T… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison among Algorithms 1 and 2, UnixGrad, and AcceleGrad for solving the logistic regres￾sion problem (6.2) with λ = 0 on the connect4 dataset. 7 Conclusion We proposed an adaptive proximal gradient method for solving nonsmooth composite optimization problems, where one component is a simple convex function and the other belongs to one of three classes: noncon￾vex smooth, convex nonsmooth, or convex s… view at source ↗
read the original abstract

We propose an adaptive proximal gradient method for minimizing the sum of two functions, where one is a simple convex function, and the other belongs to one of the three classes: nonconvex smooth, convex nonsmooth, or convex smooth. The key feature of the method is an adaptive step size that accumulates historical gradient mapping norms in the denominator. Without any modification or knowledge of problem parameters, the method converges across all three problem classes under mild bounded-iterates and bounded-variance assumptions, with rates matching those of the proximal gradient method up to logarithmic factors, in both deterministic and stochastic settings. For the convex setting, we further propose an accelerated variant. It retains a similar near-optimal convergence rate for the nonsmooth case and achieves an improved rate of order $\widetilde{O}\big(1/t^2 + \sigma/\sqrt{t}\big)$ for the smooth case, which is optimal up to logarithmic factors. Notably, we develop new techniques for controlling the effect of stochastic noise, which are applicable across all three problem classes in the stochastic setting and enable simplified analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a single adaptive proximal gradient algorithm for min f(x) + g(x) (g convex and simple) in which step sizes are formed by accumulating squared norms of the gradient mapping in the denominator. The central claim is that this construction, without any problem-class-specific modification or knowledge of smoothness/convexity parameters, attains the standard proximal-gradient convergence rates (up to logarithmic factors) simultaneously for non-convex smooth, convex nonsmooth, and convex smooth objectives, in both deterministic and stochastic regimes, under only bounded-iterates and bounded-variance assumptions. An accelerated variant is also introduced for the convex setting that preserves near-optimal rates for the nonsmooth case and improves the smooth case to Õ(1/t² + σ/√t). New noise-control arguments are developed that apply uniformly across all three classes.

Significance. If the stated rates and uniform analysis hold, the work would constitute a meaningful advance in adaptive first-order methods by eliminating the need for case-by-case tuning or parameter knowledge while retaining (near-)optimal guarantees across disparate problem classes. The uniform stochastic-noise control technique is a clear technical strength that could simplify future analyses. The accelerated convex-smooth rate being optimal up to logs is also noteworthy. The explicit identification of the bounded-iterates hypothesis as necessary for the non-convex case is appropriately cautious.

minor comments (2)
  1. [Abstract] Abstract: the claimed rates are described only qualitatively (“matching those of the proximal gradient method up to logarithmic factors”). Stating the precise rates (including the explicit log factors) already in the abstract would make the contribution immediately clearer to readers.
  2. [Introduction] The bounded-iterates assumption is flagged as necessary for the non-convex case, but its precise quantitative role in controlling the accumulated denominator is not previewed; a one-sentence remark in the introduction would help.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of our manuscript. We appreciate the recognition of the uniform analysis across the three problem classes, the new noise-control techniques, and the near-optimal rates achieved by the accelerated variant. The recommendation for minor revision is noted, and we stand ready to incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines an adaptive proximal gradient method whose step sizes are formed by accumulating squared gradient-mapping norms in the denominator, then proves convergence rates (matching proximal gradient up to logarithmic factors) for non-convex smooth, convex nonsmooth, and convex smooth objectives under explicitly stated bounded-iterates and bounded-variance assumptions. The analysis introduces new noise-control arguments that apply uniformly across classes and regimes; no equation reduces a claimed rate to a fitted parameter, self-referential definition, or prior self-citation that itself depends on the target result. The bounded-iterates hypothesis is flagged as necessary for the non-convex case and is a standard external condition rather than an ansatz smuggled via citation. The construction and rates are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions stated in the abstract; no free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption bounded-iterates assumption
    Invoked to guarantee convergence across all three problem classes in both deterministic and stochastic settings.
  • domain assumption bounded-variance assumption
    Required for the stochastic setting to control noise effects.

pith-pipeline@v0.9.0 · 5479 in / 1265 out tokens · 60679 ms · 2026-05-08T08:23:22.748827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

  1. [1]

    Attia and T

    A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance . In International Conference on Machine Learning, pages 1147--1171. PMLR, 2023

  2. [2]

    Beck and M

    A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2 0 (1): 0 183--202, 2009

  3. [3]

    J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39 0 (227): 0 357--365, 1944

  4. [4]

    Candes and B

    E. Candes and B. Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55 0 (6): 0 111--119, 2012

  5. [5]

    Chang and C

    C. Chang and C. Lin. LIBSVM : A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2 0 (3): 0 27, 2011

  6. [6]

    Cortes and V

    C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20 0 (3): 0 273--297, 1995

  7. [7]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12: 0 2121--2159, 2011

  8. [8]

    A. Ene, H. Nguyen, and A. Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314--7321, 2021

  9. [9]

    M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD : Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pages 313--355. PMLR, 2022

  10. [10]

    Ghadimi and G

    S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

  11. [11]

    Ghadimi and G

    S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156 0 (1): 0 59--99, 2016

  12. [12]

    Ghadimi, G

    S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 0 (1-2): 0 267--305, 2016

  13. [13]

    Joulani, A

    P. Joulani, A. Raj, A. Gyorgy, and C. Szepesv \'a ri. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International Conference on Machine Learning, pages 4984--4993. PMLR, 2020

  14. [14]

    Kavis, K

    A. Kavis, K. Levy, F. Bach, and V. Cevher. UniXGrad : A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 32, 2019

  15. [15]

    Kavis, K

    A. Kavis, K. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations, 2022

  16. [16]

    Kingma and J

    D. Kingma and J. Ba. Adam : A method for stochastic optimization. In International Conference on Learning Representations, 2015

  17. [17]

    G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133 0 (1-2): 0 365--397, 2012

  18. [18]

    Latafat, A

    P. Latafat, A. Themelis, and P. Patrinos. On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms. In 6th Annual Learning for Dynamics & Control Conference, pages 197--208. PMLR, 2024

  19. [19]

    Latafat, A

    P. Latafat, A. Themelis, L. Stella, and P. Patrinos. Adaptive proximal algorithms for convex optimization under local lipschitz continuity of the gradient. Mathematical Programming, 213 0 (1): 0 433--471, 2025

  20. [20]

    LeCun, L

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 2002

  21. [21]

    Levitin and B

    E. Levitin and B. Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6 0 (5): 0 1--50, 1966

  22. [22]

    K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. Advances in Neural Information Processing Systems, 30, 2017

  23. [23]

    K. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. Advances in Neural Information Processing Systems, 31, 2018

  24. [24]

    Lions and B

    P. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16 0 (6): 0 964--979, 1979

  25. [25]

    Malitsky and K

    Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. In International Conference on Machine Learning, pages 6702--6712. PMLR, 2020

  26. [26]

    Malitsky and K

    Y. Malitsky and K. Mishchenko. Adaptive proximal gradient method for convex optimization. Advances in Neural Information Processing Systems, 37: 0 100670--100697, 2024

  27. [27]

    Nesterov

    Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) . In Dokl. Akad. Nauk SSSR, volume 269, pages 543--547, 1983

  28. [28]

    Nesterov

    Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2003

  29. [29]

    Nesterov

    Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140 0 (1): 0 125--161, 2013

  30. [30]

    Nesterov

    Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 0 (1-2): 0 381--404, 2015

  31. [31]

    Parikh and S

    N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1 0 (3): 0 127--239, 2014

  32. [32]

    G. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72 0 (2): 0 383--390, 1979

  33. [33]

    Reddi, S

    S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018

  34. [34]

    Rodomanov, X

    A. Rodomanov, X. Jiang, and S. Stich. Universality of AdaGrad stepsizes for stochastic optimization: Inexact oracle, acceleration and variance reduction. Advances in Neural Information Processing Systems, 37: 0 26770--26813, 2024

  35. [35]

    Tibshirani

    R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267--288, 1996

  36. [36]

    Tieleman and G

    T. Tieleman and G. Hinton. Lecture 6.5---RmsProp : Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

  37. [37]

    P. Tseng. A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38 0 (2): 0 431--446, 2000

  38. [38]

    B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161--190. PMLR, 2023 a

  39. [39]

    J. Wang, X. Wang, and L. Zhang. Stochastic regularized Newton methods for nonlinear equations. Journal of Scientific Computing, 94 0 (3): 0 51, 2023 b

  40. [40]

    X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi- Newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017

  41. [41]

    R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21: 0 1--30, 2020

  42. [42]

    J. Yun, A. Lozano, and E. Yang. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34: 0 24365--24378, 2021

  43. [43]

    M. Zeiler. ADADELTA : An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012