Projected Gradient Methods with Momentum
Pith reviewed 2026-05-16 11:45 UTC · model grok-4.3
The pith
Momentum can be added to projected gradient methods for smooth nonconvex objectives over convex sets while retaining convergence guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A projected gradient algorithm augmented with momentum terms possesses the same convergence and complexity guarantees as its non-momentum counterpart yet produces faster practical progress on the tested constrained problems.
What carries the argument
A tailored projected gradient iteration that incorporates a momentum correction before the projection step, derived from a general convergence framework for momentum-augmented projected methods.
If this is right
- The added momentum term does not worsen the worst-case complexity bound of projected gradient methods.
- The same convergence theory covers both convex and nonconvex objectives under the stated smoothness assumption.
- The per-iteration overhead remains comparable to a single gradient evaluation plus projection.
- Empirical gains appear in at least two distinct application domains.
Where Pith is reading between the lines
- The same momentum construction could be tested on other first-order constrained schemes such as conditional gradient methods.
- If the projection cost dominates, the practical advantage of momentum may shrink relative to the unconstrained case.
- Extensions to stochastic or inexact gradient oracles would require only minor changes to the existing analysis.
Load-bearing premise
The objective is smooth and the Euclidean projection onto the convex constraint set is available at reasonable cost.
What would settle it
A problem instance on which the momentum version fails to reach the same accuracy as the plain spectral projected gradient method within the stated iteration budget.
Figures
read the original abstract
We focus on the optimization problem with smooth, possibly nonconvex objectives and a convex constraint set for which the Euclidean projection operation is practically available. Focusing on this setting, we carry out a general convergence and complexity analysis for algorithmic frameworks. Consequently, we discuss theoretically sound strategies to integrate momentum information within classical projected gradient type algorithms. One of these approaches is then developed in detail, up to the definition of a tailored algorithm with both theoretical guarantees and reasonable per-iteration cost. The proposed method is finally shown to outperform the standard (spectral) projected gradient method in two different experimental benchmarks, indicating that the addition of momentum terms is as beneficial in the constrained setting as it is in the unconstrained scenario.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses optimization of smooth (possibly nonconvex) objectives over convex constraint sets where the Euclidean projection is available. It develops a general convergence and complexity analysis for projected-gradient frameworks, identifies theoretically sound ways to incorporate momentum, fully specifies one such algorithm with per-iteration cost comparable to the baseline, proves convergence guarantees for it, and reports that the method outperforms the spectral projected-gradient baseline on two numerical benchmarks.
Significance. If the stated rates and empirical gains hold, the work supplies a practical, theoretically justified way to accelerate projected-gradient methods in the constrained setting, extending the well-known benefits of momentum from the unconstrained case. The explicit complexity bounds and the fact that the algorithm retains the same per-iteration cost as the baseline are concrete strengths.
major comments (2)
- [§3.2, Theorem 3.4] §3.2, Theorem 3.4: the descent inequality used to obtain the O(1/T) rate for the nonconvex case absorbs the momentum term only under the assumption that the momentum coefficient β satisfies β ≤ 1/(2L); this restriction is not stated in the theorem statement itself and appears to be needed for the final bound, so the claimed generality of the rate should be clarified.
- [§4.1, Algorithm 1] §4.1, Algorithm 1: the per-iteration cost claim relies on the projection being computed exactly once per step, yet the momentum update introduces an extra vector addition whose numerical stability is not analyzed; when the constraint set is a simplex or a box, this addition can accumulate floating-point error that the convergence proof does not account for.
minor comments (2)
- [§5] The abstract states that the method 'outperforms' the baseline in two benchmarks, but §5 does not report the number of random seeds or statistical significance tests; adding these would strengthen the empirical claim.
- [§2 and §4.3] Notation for the momentum coefficient is introduced as β in §2 and then reused as γ in the proof of Lemma 4.3; a single consistent symbol would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the precise comments, which help strengthen the presentation of our results. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3.2, Theorem 3.4] §3.2, Theorem 3.4: the descent inequality used to obtain the O(1/T) rate for the nonconvex case absorbs the momentum term only under the assumption that the momentum coefficient β satisfies β ≤ 1/(2L); this restriction is not stated in the theorem statement itself and appears to be needed for the final bound, so the claimed generality of the rate should be clarified.
Authors: We agree that the condition β ≤ 1/(2L) is required for the descent inequality to absorb the momentum contribution and obtain the stated O(1/T) rate in the nonconvex setting. This bound on β was used in the proof but omitted from the theorem statement. We will revise Theorem 3.4 to state the restriction explicitly. revision: yes
-
Referee: [§4.1, Algorithm 1] §4.1, Algorithm 1: the per-iteration cost claim relies on the projection being computed exactly once per step, yet the momentum update introduces an extra vector addition whose numerical stability is not analyzed; when the constraint set is a simplex or a box, this addition can accumulate floating-point error that the convergence proof does not account for.
Authors: The momentum step consists of one vector addition and scaling whose arithmetic cost is negligible compared with the projection. The per-iteration complexity claim therefore remains unchanged. Our convergence analysis is performed in exact arithmetic, the standard setting for such first-order guarantees. A detailed floating-point analysis lies outside the scope of the paper; we will add a short remark in Section 4.1 recording the exact-arithmetic assumption and noting that the reported experiments on simplex and box constraints exhibit no observable instability. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper performs a standard convergence and complexity analysis for momentum-augmented projected gradient methods under the assumptions of smooth (possibly nonconvex) objectives and convex constraint sets with available Euclidean projection. Theoretical guarantees follow from classical smoothness and descent arguments without reducing any prediction or result to a fitted parameter or self-referential definition. The tailored algorithm and its per-iteration cost are derived directly from the framework, and experimental outperformance on benchmarks is presented as independent empirical evidence rather than a constructed outcome. No load-bearing self-citations, imported uniqueness theorems, or ansatzes that circularly define the central claims appear in the abstract or described structure; the derivation remains self-contained against external optimization benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- momentum coefficient
axioms (1)
- domain assumption The objective function is smooth and the constraint set is convex with a practically available Euclidean projection.
Reference graph
Works this paper leans on
-
[1]
IMA Journal of Nu- merical Analysis25(2), 221–252 (2005)
Andreani, R., Birgin, E.G., Mart´ ınez, J.M., Yuan, J.: Spectral projected gradient and variable metric methods for optimization with linear inequalities. IMA Journal of Nu- merical Analysis25(2), 221–252 (2005)
work page 2005
-
[2]
Numerical Algorithms 53(1), 23–52 (2010)
Andretta, M., Birgin, E.G., Mart´ ınez, J.M.: Partial spectral projected gradient method with active-set strategy for linearly constrained optimization. Numerical Algorithms 53(1), 23–52 (2010)
work page 2010
-
[3]
Available: https://doi.org/10.1093/imanum/8.1.141
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA Journal of Numerical Analysis8, 141–148 (1988). DOI 10.1093/imanum/8.1.141
-
[4]
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (1999)
work page 1999
-
[5]
Birgin, E.G., Mario Mart´ ınez, J.: Large-scale active-set box-constrained optimization method with spectral projected gradients. Computational Optimization and Applica- tions23(1), 101–125 (2002) Projected gradient methods with momentum 23
work page 2002
-
[6]
Nonmonotone spectral projected gradient methods on convex sets
Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization10(4), 1196–1211 (2000). DOI 10.1137/S1052623497330963. URLhttps://doi.org/10.1137/S1052623497330963
-
[7]
Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Algorithm 813: Spg—software for convex- constrained optimization. ACM Trans. Math. Softw.27(3), 340–349 (2001). DOI 10.1145/502800.502803. URLhttps://doi.org/10.1145/502800.502803
-
[8]
IMA Journal of Numerical Analysis23(4), 539–559 (2003)
Birgin, E.G., Mart´ ınez, J.M., Raydan, M.: Inexact spectral projected gradient methods on convex sets. IMA Journal of Numerical Analysis23(4), 539–559 (2003)
work page 2003
-
[9]
Inverse Problems31(9), 095,008 (2015)
Bonettini, S., Prato, M.: New convergence results for the scaled gradient projection method. Inverse Problems31(9), 095,008 (2015)
work page 2015
-
[10]
Inverse Problems25(1), 015,002 (2008)
Bonettini, S., Zanella, R., Zanni, L.: A scaled gradient projection method for constrained image deblurring. Inverse Problems25(1), 015,002 (2008)
work page 2008
-
[11]
SIAM Review60(2), 223–311 (2018)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review60(2), 223–311 (2018)
work page 2018
-
[12]
Cartis, C., Gould, N.I., Toint, P.L.: Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM (2022)
work page 2022
-
[13]
Mathematical Program- ming158(1), 575–585 (2016)
Condat, L.: Fast projection onto the simplex and the l1 ball. Mathematical Program- ming158(1), 575–585 (2016)
work page 2016
-
[14]
Computational Optimization and Applications83(2), 693–721 (2022)
Cristofari, A., De Santis, M., Lucidi, S., Rinaldi, F.: Minimization over theℓ 1-ball using an active-set non-monotone projected gradient. Computational Optimization and Applications83(2), 693–721 (2022)
work page 2022
-
[15]
Mathematical Programming91(2), 201–213 (2002)
Dolan, E.D., Mor´ e, J.J.: Benchmarking optimization software with performance profiles. Mathematical Programming91(2), 201–213 (2002)
work page 2002
-
[16]
Computational Optimization and Applications81(1), 91–125 (2022)
Ferreira, O.P., Lemes, M., Prudente, L.F.: On the inexact scaled gradient projection method. Computational Optimization and Applications81(1), 91–125 (2022)
work page 2022
-
[17]
IEEE Transactions on Information Theory64(10), 6707–6721 (2018)
Golbabaee, M., Davies, M.E.: Inexact gradient projection and fast data driven com- pressed sensing. IEEE Transactions on Information Theory64(10), 6707–6721 (2018)
work page 2018
-
[18]
Optimization71(1), 145–163 (2022)
Gon¸ calves, D.S., Gon¸ calves, M.L., Menezes, T.C.: Inexact variable metric method for convex-constrained optimization problems. Optimization71(1), 145–163 (2022)
work page 2022
-
[19]
Computational Opti- mization and Applications60(2014)
Gould, N., Orban, D., Toint, P.: CUTEst: a Constrained and Unconstrained Testing Environment with safe threads for mathematical optimization. Computational Opti- mization and Applications60(2014). DOI 10.1007/s10589-014-9687-3
-
[20]
Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for new- ton’s method. SIAM Journal on Numerical Analysis23(4), 707–716 (1986). DOI 10.1137/0723046. URLhttps://doi.org/10.1137/0723046
- [21]
-
[22]
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. (2009)
work page 2009
-
[23]
Mathematical Pro- gramming199(1), 1365–1415 (2023)
Jia, X., Kanzow, C., Mehlitz, P., Wachsmuth, G.: An augmented Lagrangian method for optimization problems with structured geometric constraints. Mathematical Pro- gramming199(1), 1365–1415 (2023)
work page 2023
-
[24]
Computational Optimization and Applications pp
Lapucci, M., Liuzzi, G., Lucidi, S., Pucci, D., Sciandrone, M.: A globally convergent gradient method with momentum. Computational Optimization and Applications pp. 1–26 (2025)
work page 2025
-
[25]
Mathe- matical Programming Computation14(3), 543–591 (2022)
Lee, C.p., Wang, P.W., Lin, C.J.: Limited-memory common-directions method for large- scale optimization: convergence, parallelization, and distributed optimization. Mathe- matical Programming Computation14(3), 543–591 (2022)
work page 2022
-
[26]
Lee, S.I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient L1 regularized logistic regression. In: AAAI, vol. 6, pp. 401–408 (2006)
work page 2006
-
[27]
USSR Computational Mathematics and Mathematical Physics6(5), 1–50 (1966)
Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics6(5), 1–50 (1966)
work page 1966
-
[28]
Liu, J., Chen, J., Ye, J.: Large-scale sparse logistic regression. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–556 (2009)
work page 2009
-
[29]
doi: 10.1016/0041-5553(64)90137-5
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics4(5), 1–17 (1964). DOI https://doi.org/10.1016/0041-5553(64)90137-5. URLhttps://www.sciencedirect. com/science/article/pii/0041555364901375 24 Lapucci et al
-
[30]
SIAM Journal on Optimization7(1), 26–33 (1997)
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization7(1), 26–33 (1997). DOI 10. 1137/S1052623494266365. URLhttps://doi.org/10.1137/S1052623494266365
-
[31]
SIAM Journal on Scientific Computing46(3), A2025–A2046 (2024)
Tang, T., Toh, K.C., Xiao, N., Ye, Y.: A Riemannian dimension-reduced second-order method with application in sensor network localization. SIAM Journal on Scientific Computing46(3), A2025–A2046 (2024). DOI 10.1137/23M1567229. URLhttps:// doi.org/10.1137/23M1567229
-
[32]
IEEE Transactions on Neural Networks and Learning Systems33(3), 1107–1118 (2022)
Tao, W., Wu, G.W., Tao, Q.: Momentum acceleration in the individual convergence of nonsmooth convex optimization with constraints. IEEE Transactions on Neural Networks and Learning Systems33(3), 1107–1118 (2022)
work page 2022
-
[33]
Wright, S.J., Recht, B.: Optimization for Data Analysis. Cambridge University Press (2022) Appendix: Closed form solution for the problem of minimizing a bidimensional quadratic function over a polytope We consider the two-dimensional optimization problem min α,β φ(α, β) = 1 2 α β T t u u w α β + y h T α β s.t.α≥0, β≥0, α+β≤1. (29) The feasible set of the...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.