pith. sign in

arxiv: 2601.16683 · v2 · submitted 2026-01-23 · 🧮 math.OC

Projected Gradient Methods with Momentum

Pith reviewed 2026-05-16 11:45 UTC · model grok-4.3

classification 🧮 math.OC
keywords projected gradient methodsmomentum accelerationconstrained optimizationnonconvex optimizationconvergence analysisspectral gradient
0
0 comments X

The pith

Momentum can be added to projected gradient methods for smooth nonconvex objectives over convex sets while retaining convergence guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes general algorithmic frameworks that combine projected gradient steps with momentum for smooth possibly nonconvex problems whose convex constraint set admits an efficient Euclidean projection. It identifies theoretically sound ways to insert momentum information and then works out one concrete algorithm that keeps per-iteration cost low. The method is shown to converge and to beat the standard spectral projected gradient method on two numerical benchmarks.

Core claim

A projected gradient algorithm augmented with momentum terms possesses the same convergence and complexity guarantees as its non-momentum counterpart yet produces faster practical progress on the tested constrained problems.

What carries the argument

A tailored projected gradient iteration that incorporates a momentum correction before the projection step, derived from a general convergence framework for momentum-augmented projected methods.

If this is right

  • The added momentum term does not worsen the worst-case complexity bound of projected gradient methods.
  • The same convergence theory covers both convex and nonconvex objectives under the stated smoothness assumption.
  • The per-iteration overhead remains comparable to a single gradient evaluation plus projection.
  • Empirical gains appear in at least two distinct application domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same momentum construction could be tested on other first-order constrained schemes such as conditional gradient methods.
  • If the projection cost dominates, the practical advantage of momentum may shrink relative to the unconstrained case.
  • Extensions to stochastic or inexact gradient oracles would require only minor changes to the existing analysis.

Load-bearing premise

The objective is smooth and the Euclidean projection onto the convex constraint set is available at reasonable cost.

What would settle it

A problem instance on which the momentum version fails to reach the same accuracy as the plain spectral projected gradient method within the stated iteration budget.

Figures

Figures reproduced from arXiv: 2601.16683 by Diego Scuppa, Giampaolo Liuzzi, Marco Sciandrone, Matteo Lapucci, Stefano Lucidi.

Figure 1
Figure 1. Figure 1: Performance profiles for the comparison between SPG and PGMM on the 100 LR instances with ℓ1-ball constraints. (a) CPU Time = 2 4 6 8 10 ; a ( = ) 0 0.2 0.4 0.6 0.8 1 SPG PGMM (b) Iterations = 2 4 6 8 10 ; a ( = ) 0 0.2 0.4 0.6 0.8 1 SPG PGMM (c) Function evaluations = 2 4 6 8 10 ; a ( = ) 0 0.2 0.4 0.6 0.8 1 SPG PGMM [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Feasible set of the bidimensional problem in [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
read the original abstract

We focus on the optimization problem with smooth, possibly nonconvex objectives and a convex constraint set for which the Euclidean projection operation is practically available. Focusing on this setting, we carry out a general convergence and complexity analysis for algorithmic frameworks. Consequently, we discuss theoretically sound strategies to integrate momentum information within classical projected gradient type algorithms. One of these approaches is then developed in detail, up to the definition of a tailored algorithm with both theoretical guarantees and reasonable per-iteration cost. The proposed method is finally shown to outperform the standard (spectral) projected gradient method in two different experimental benchmarks, indicating that the addition of momentum terms is as beneficial in the constrained setting as it is in the unconstrained scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper addresses optimization of smooth (possibly nonconvex) objectives over convex constraint sets where the Euclidean projection is available. It develops a general convergence and complexity analysis for projected-gradient frameworks, identifies theoretically sound ways to incorporate momentum, fully specifies one such algorithm with per-iteration cost comparable to the baseline, proves convergence guarantees for it, and reports that the method outperforms the spectral projected-gradient baseline on two numerical benchmarks.

Significance. If the stated rates and empirical gains hold, the work supplies a practical, theoretically justified way to accelerate projected-gradient methods in the constrained setting, extending the well-known benefits of momentum from the unconstrained case. The explicit complexity bounds and the fact that the algorithm retains the same per-iteration cost as the baseline are concrete strengths.

major comments (2)
  1. [§3.2, Theorem 3.4] §3.2, Theorem 3.4: the descent inequality used to obtain the O(1/T) rate for the nonconvex case absorbs the momentum term only under the assumption that the momentum coefficient β satisfies β ≤ 1/(2L); this restriction is not stated in the theorem statement itself and appears to be needed for the final bound, so the claimed generality of the rate should be clarified.
  2. [§4.1, Algorithm 1] §4.1, Algorithm 1: the per-iteration cost claim relies on the projection being computed exactly once per step, yet the momentum update introduces an extra vector addition whose numerical stability is not analyzed; when the constraint set is a simplex or a box, this addition can accumulate floating-point error that the convergence proof does not account for.
minor comments (2)
  1. [§5] The abstract states that the method 'outperforms' the baseline in two benchmarks, but §5 does not report the number of random seeds or statistical significance tests; adding these would strengthen the empirical claim.
  2. [§2 and §4.3] Notation for the momentum coefficient is introduced as β in §2 and then reused as γ in the proof of Lemma 4.3; a single consistent symbol would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the precise comments, which help strengthen the presentation of our results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3.2, Theorem 3.4] §3.2, Theorem 3.4: the descent inequality used to obtain the O(1/T) rate for the nonconvex case absorbs the momentum term only under the assumption that the momentum coefficient β satisfies β ≤ 1/(2L); this restriction is not stated in the theorem statement itself and appears to be needed for the final bound, so the claimed generality of the rate should be clarified.

    Authors: We agree that the condition β ≤ 1/(2L) is required for the descent inequality to absorb the momentum contribution and obtain the stated O(1/T) rate in the nonconvex setting. This bound on β was used in the proof but omitted from the theorem statement. We will revise Theorem 3.4 to state the restriction explicitly. revision: yes

  2. Referee: [§4.1, Algorithm 1] §4.1, Algorithm 1: the per-iteration cost claim relies on the projection being computed exactly once per step, yet the momentum update introduces an extra vector addition whose numerical stability is not analyzed; when the constraint set is a simplex or a box, this addition can accumulate floating-point error that the convergence proof does not account for.

    Authors: The momentum step consists of one vector addition and scaling whose arithmetic cost is negligible compared with the projection. The per-iteration complexity claim therefore remains unchanged. Our convergence analysis is performed in exact arithmetic, the standard setting for such first-order guarantees. A detailed floating-point analysis lies outside the scope of the paper; we will add a short remark in Section 4.1 recording the exact-arithmetic assumption and noting that the reported experiments on simplex and box constraints exhibit no observable instability. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper performs a standard convergence and complexity analysis for momentum-augmented projected gradient methods under the assumptions of smooth (possibly nonconvex) objectives and convex constraint sets with available Euclidean projection. Theoretical guarantees follow from classical smoothness and descent arguments without reducing any prediction or result to a fitted parameter or self-referential definition. The tailored algorithm and its per-iteration cost are derived directly from the framework, and experimental outperformance on benchmarks is presented as independent empirical evidence rather than a constructed outcome. No load-bearing self-citations, imported uniqueness theorems, or ansatzes that circularly define the central claims appear in the abstract or described structure; the derivation remains self-contained against external optimization benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard domain assumptions for constrained optimization; momentum parameters are design choices rather than fitted values derived from data.

free parameters (1)
  • momentum coefficient
    A tunable parameter in the momentum update rule, selected as part of algorithm design or empirical tuning.
axioms (1)
  • domain assumption The objective function is smooth and the constraint set is convex with a practically available Euclidean projection.
    Explicitly stated as the focus of the problem setting in the abstract.

pith-pipeline@v0.9.0 · 5414 in / 1134 out tokens · 46049 ms · 2026-05-16T11:45:15.924131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    IMA Journal of Nu- merical Analysis25(2), 221–252 (2005)

    Andreani, R., Birgin, E.G., Mart´ ınez, J.M., Yuan, J.: Spectral projected gradient and variable metric methods for optimization with linear inequalities. IMA Journal of Nu- merical Analysis25(2), 221–252 (2005)

  2. [2]

    Numerical Algorithms 53(1), 23–52 (2010)

    Andretta, M., Birgin, E.G., Mart´ ınez, J.M.: Partial spectral projected gradient method with active-set strategy for linearly constrained optimization. Numerical Algorithms 53(1), 23–52 (2010)

  3. [3]

    Available: https://doi.org/10.1093/imanum/8.1.141

    Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA Journal of Numerical Analysis8, 141–148 (1988). DOI 10.1093/imanum/8.1.141

  4. [4]

    Athena Scientific (1999)

    Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (1999)

  5. [5]

    Computational Optimization and Applica- tions23(1), 101–125 (2002) Projected gradient methods with momentum 23

    Birgin, E.G., Mario Mart´ ınez, J.: Large-scale active-set box-constrained optimization method with spectral projected gradients. Computational Optimization and Applica- tions23(1), 101–125 (2002) Projected gradient methods with momentum 23

  6. [6]

    Nonmonotone spectral projected gradient methods on convex sets

    Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization10(4), 1196–1211 (2000). DOI 10.1137/S1052623497330963. URLhttps://doi.org/10.1137/S1052623497330963

  7. [7]

    ACM Trans

    Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Algorithm 813: Spg—software for convex- constrained optimization. ACM Trans. Math. Softw.27(3), 340–349 (2001). DOI 10.1145/502800.502803. URLhttps://doi.org/10.1145/502800.502803

  8. [8]

    IMA Journal of Numerical Analysis23(4), 539–559 (2003)

    Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Inexact spectral projected gradient methods on convex sets. IMA Journal of Numerical Analysis23(4), 539–559 (2003)

  9. [9]

    Inverse Problems31(9), 095,008 (2015)

    Bonettini, S., Prato, M.: New convergence results for the scaled gradient projection method. Inverse Problems31(9), 095,008 (2015)

  10. [10]

    Inverse Problems25(1), 015,002 (2008)

    Bonettini, S., Zanella, R., Zanni, L.: A scaled gradient projection method for constrained image deblurring. Inverse Problems25(1), 015,002 (2008)

  11. [11]

    SIAM Review60(2), 223–311 (2018)

    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review60(2), 223–311 (2018)

  12. [12]

    SIAM (2022)

    Cartis, C., Gould, N.I., Toint, P.L.: Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM (2022)

  13. [13]

    Mathematical Program- ming158(1), 575–585 (2016)

    Condat, L.: Fast projection onto the simplex and the l1 ball. Mathematical Program- ming158(1), 575–585 (2016)

  14. [14]

    Computational Optimization and Applications83(2), 693–721 (2022)

    Cristofari, A., De Santis, M., Lucidi, S., Rinaldi, F.: Minimization over theℓ 1-ball using an active-set non-monotone projected gradient. Computational Optimization and Applications83(2), 693–721 (2022)

  15. [15]

    Mathematical Programming91(2), 201–213 (2002)

    Dolan, E.D., Mor´ e, J.J.: Benchmarking optimization software with performance profiles. Mathematical Programming91(2), 201–213 (2002)

  16. [16]

    Computational Optimization and Applications81(1), 91–125 (2022)

    Ferreira, O.P., Lemes, M., Prudente, L.F.: On the inexact scaled gradient projection method. Computational Optimization and Applications81(1), 91–125 (2022)

  17. [17]

    IEEE Transactions on Information Theory64(10), 6707–6721 (2018)

    Golbabaee, M., Davies, M.E.: Inexact gradient projection and fast data driven com- pressed sensing. IEEE Transactions on Information Theory64(10), 6707–6721 (2018)

  18. [18]

    Optimization71(1), 145–163 (2022)

    Gon¸ calves, D.S., Gon¸ calves, M.L., Menezes, T.C.: Inexact variable metric method for convex-constrained optimization problems. Optimization71(1), 145–163 (2022)

  19. [19]

    Computational Opti- mization and Applications60(2014)

    Gould, N., Orban, D., Toint, P.: CUTEst: a Constrained and Unconstrained Testing Environment with safe threads for mathematical optimization. Computational Opti- mization and Applications60(2014). DOI 10.1007/s10589-014-9687-3

  20. [20]

    Grippo, F

    Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for new- ton’s method. SIAM Journal on Numerical Analysis23(4), 707–716 (1986). DOI 10.1137/0723046. URLhttps://doi.org/10.1137/0723046

  21. [21]

    Uni- text

    Grippo, L., Sciandrone, M.: Introduction to methods for nonlinear optimization. Uni- text. Springer, Cham (2023)

  22. [22]

    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. (2009)

  23. [23]

    Mathematical Pro- gramming199(1), 1365–1415 (2023)

    Jia, X., Kanzow, C., Mehlitz, P., Wachsmuth, G.: An augmented Lagrangian method for optimization problems with structured geometric constraints. Mathematical Pro- gramming199(1), 1365–1415 (2023)

  24. [24]

    Computational Optimization and Applications pp

    Lapucci, M., Liuzzi, G., Lucidi, S., Pucci, D., Sciandrone, M.: A globally convergent gradient method with momentum. Computational Optimization and Applications pp. 1–26 (2025)

  25. [25]

    Mathe- matical Programming Computation14(3), 543–591 (2022)

    Lee, C.p., Wang, P.W., Lin, C.J.: Limited-memory common-directions method for large- scale optimization: convergence, parallelization, and distributed optimization. Mathe- matical Programming Computation14(3), 543–591 (2022)

  26. [26]

    In: AAAI, vol

    Lee, S.I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient L1 regularized logistic regression. In: AAAI, vol. 6, pp. 401–408 (2006)

  27. [27]

    USSR Computational Mathematics and Mathematical Physics6(5), 1–50 (1966)

    Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics6(5), 1–50 (1966)

  28. [28]

    In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp

    Liu, J., Chen, J., Ye, J.: Large-scale sparse logistic regression. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–556 (2009)

  29. [29]

    doi: 10.1016/0041-5553(64)90137-5

    Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics4(5), 1–17 (1964). DOI https://doi.org/10.1016/0041-5553(64)90137-5. URLhttps://www.sciencedirect. com/science/article/pii/0041555364901375 24 Lapucci et al

  30. [30]

    SIAM Journal on Optimization7(1), 26–33 (1997)

    Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization7(1), 26–33 (1997). DOI 10. 1137/S1052623494266365. URLhttps://doi.org/10.1137/S1052623494266365

  31. [31]

    SIAM Journal on Scientific Computing46(3), A2025–A2046 (2024)

    Tang, T., Toh, K.C., Xiao, N., Ye, Y.: A Riemannian dimension-reduced second-order method with application in sensor network localization. SIAM Journal on Scientific Computing46(3), A2025–A2046 (2024). DOI 10.1137/23M1567229. URLhttps:// doi.org/10.1137/23M1567229

  32. [32]

    IEEE Transactions on Neural Networks and Learning Systems33(3), 1107–1118 (2022)

    Tao, W., Wu, G.W., Tao, Q.: Momentum acceleration in the individual convergence of nonsmooth convex optimization with constraints. IEEE Transactions on Neural Networks and Learning Systems33(3), 1107–1118 (2022)

  33. [33]

    Wright, S.J., Recht, B.: Optimization for Data Analysis. Cambridge University Press (2022) Appendix: Closed form solution for the problem of minimizing a bidimensional quadratic function over a polytope We consider the two-dimensional optimization problem min α,β φ(α, β) = 1 2 α β T t u u w α β + y h T α β s.t.α≥0, β≥0, α+β≤1. (29) The feasible set of the...