arxiv: 2604.13272 · v1 · submitted 2026-04-14 · 🧮 math.OC · cs.SY· eess.SY

Recognition: unknown

A Momentum-based Stochastic Algorithm for Linearly Constrained Nonconvex Optimization

Chenyang Qiu , Mihitha Maithripala , Zongli Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:34 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords stochastic optimizationnonconvex optimizationlinear constraintsaugmented Lagrangianmomentum methodKKT residualconvergence ratePolyak estimator

0 comments

The pith

A momentum-based augmented Lagrangian method finds an ε-stationary solution for linearly constrained nonconvex optimization in O(ε^{-4}) stochastic gradient evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a stochastic optimization algorithm for nonconvex objectives with linear constraints when only noisy, unbiased gradient estimates with bounded variance are available. It combines a momentum scheme with an augmented Lagrangian formulation and a Polyak-type estimator so that each iteration needs only a single stochastic gradient. The analysis shows that this produces an approximate first-order stationary point, measured by the KKT residual, after O(ε^{-4}) gradient evaluations in expectation. The result matters because many engineering and machine-learning tasks involve both constraints and stochastic data, and a method that avoids extra gradient computations per step can translate directly into lower runtime.

Core claim

The proposed momentum-based augmented Lagrangian method employs a Polyak-type gradient estimator and requires only one stochastic gradient evaluation per iteration. Under the standard stochastic oracle model and the smoothness condition of the expected objective, the paper establishes a convergence guarantee in terms of the first-order KKT residual of the original constrained problem. In particular, the method computes an ε-stationary solution in expectation within O(ε^{-4}) stochastic gradient evaluations.

What carries the argument

Momentum-based augmented Lagrangian method with a Polyak-type gradient estimator that performs one stochastic gradient evaluation per iteration while updating primal and dual variables.

If this is right

The algorithm matches the iteration complexity of representative recursive-momentum methods while using fewer gradient evaluations per step.
Numerical tests demonstrate competitive iteration counts and reduced wall-clock time relative to the baselines.
The convergence guarantee applies directly to the KKT residual of the original linearly constrained problem.
The single-gradient-per-iteration design preserves efficiency under the standard bounded-variance stochastic model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rate suggests that linear constraints do not asymptotically worsen the complexity relative to the unconstrained stochastic nonconvex setting.
The one-evaluation structure could support larger-scale applications where each gradient computation is costly.
Similar momentum techniques might be tested on problems whose constraints are only approximately linear.

Load-bearing premise

The expected objective is smooth and the stochastic gradients are unbiased with bounded variance.

What would settle it

An experiment on a smooth test problem satisfying the stochastic oracle assumptions where the observed number of gradient evaluations needed to reach ε-stationarity in expectation exceeds the O(ε^{-4}) bound.

Figures

Figures reproduced from arXiv: 2604.13272 by Chenyang Qiu, Mihitha Maithripala, Zongli Lin.

**Figure 1.** Figure 1: The stationarity residual of primal variables over [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 5.** Figure 5: The stationarity residual of primal variables over [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The violation of constraints over stochastic gradie [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

This paper studies a stochastic algorithm for linearly constrained nonconvex optimization, where the objective function is smooth but only unbiased stochastic gradients with bounded variance are available. We propose a momentum-based augmented Lagrangian method that employs a Polyak-type gradient estimator and requires only one stochastic gradient evaluation per iteration. Under the standard stochastic oracle model and the smoothness condition of the expected objective, we establish a convergence guarantee in terms of the first-order KKT residual of the original constrained problem. In particular, the proposed method computes an $\epsilon$-stationary solution in expectation within $O(\epsilon^{-4})$ stochastic gradient evaluations. Numerical experiments further show that the proposed method achieves competitive iteration complexity and improved wall-clock efficiency compared with representative recursive-momentum baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes a momentum-based augmented Lagrangian method with a Polyak-type gradient estimator for linearly constrained nonconvex optimization problems, where only unbiased stochastic gradients with bounded variance are available. The method requires one stochastic gradient evaluation per iteration and establishes a convergence guarantee showing that an ε-stationary solution (in terms of the first-order KKT residual) is obtained in expectation after O(ε^{-4}) stochastic gradient evaluations, under the standard stochastic oracle model and smoothness of the expected objective. Numerical experiments compare the method to recursive-momentum baselines.

Significance. If the result holds, the work provides a practical single-oracle-per-iteration algorithm for a challenging class of problems, with a Lyapunov-based analysis that combines the augmented Lagrangian, constraint violation, and momentum terms to derive the stated complexity. The exact handling of linear constraints via the augmented term and the preservation of the single-gradient property are strengths.

minor comments (2)

[Numerical Experiments] The numerical experiments section summarizes performance but does not include quantitative tables, specific iteration counts, wall-clock times, or statistical details (e.g., means and variances over multiple runs), which weakens the ability to verify the claimed competitive iteration complexity and improved efficiency.
[Main Results] Notation for the KKT residual and stationarity measure could be clarified with an explicit definition or reference to the precise quantity whose expectation is bounded by ε (e.g., in the statement of the main theorem).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the positive recommendation for minor revision. The referee's summary accurately reflects the contributions of our momentum-based augmented Lagrangian method, including the single stochastic gradient evaluation per iteration and the O(ε^{-4}) complexity result under standard assumptions. Since the report lists no specific major comments, we have no points requiring rebuttal or revision at this time. We remain available to address any minor suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under standard assumptions

full rationale

The paper's central O(ε^{-4}) bound on expected KKT residual is obtained from a Lyapunov function combining augmented Lagrangian value, constraint violation, and momentum deviation. A one-step expected decrease inequality is derived from the Polyak gradient estimator and smoothness, then combined with standard variance bounding and telescoping summation over iterations. This follows conventional stochastic nonconvex analysis without any self-definitional reduction, fitted inputs renamed as predictions, or load-bearing self-citations; the linear constraint handling is exact via the augmented term and does not presuppose the target rate. The result is externally falsifiable against the stated stochastic oracle model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions in stochastic optimization without introducing new free parameters or invented entities.

axioms (2)

domain assumption smoothness of the expected objective function
Invoked to establish the convergence guarantee under the stochastic oracle model.
domain assumption unbiased stochastic gradients with bounded variance
Standard stochastic oracle model required for the O(ε^{-4}) rate.

pith-pipeline@v0.9.0 · 5426 in / 1228 out tokens · 26193 ms · 2026-05-10T14:34:44.901198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 3 canonical work pages

[1]

Stochastic ﬁrst-and zeroth-orde r methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic ﬁrst-and zeroth-orde r methods for nonconvex stochastic programming,” SIAM journal on optimization , vol. 23, no. 4, pp. 2341–2368, 2013

2013
[2]

Algorithms for ﬁtting t he constrained lasso,

B. R. Gaines, J. Kim, and H. Zhou, “Algorithms for ﬁtting t he constrained lasso,” Journal of Computational and Graphical Statistics , vol. 27, no. 4, pp. 861–871, 2018

2018
[3]

A splittin g method for optimal control,

B. O’Donoghue, G. Stathopoulos, and S. Boyd, “A splittin g method for optimal control,” IEEE Transactions on Control Systems Technology , vol. 21, no. 6, pp. 2432–2442, 2013

2013
[4]

Design of optima l sparse feedback gains via the alternating direction method of mult ipliers,

F. Lin, M. Fardad, and M. R. Jovanovi´ c, “Design of optima l sparse feedback gains via the alternating direction method of mult ipliers,” IEEE Transactions on Automatic Control , vol. 58, no. 9, pp. 2426– 2431, 2013

2013
[5]

A mixing-acceler ated primal-dual proximal algorithm for distributed nonconvex optimiza- tion,

O. Zichong, C. Qiu, D. Wang, and L. Jie, “A mixing-acceler ated primal-dual proximal algorithm for distributed nonconvex optimiza- tion,” in 2024 American Control Conference (ACC) . IEEE, 2024, pp. 167–172

2024
[6]

Non-ergodic convergence algorithms f or dis- tributed consensus and coupling-constrained optimizatio n,

C. Qiu and Z. Lin, “Non-ergodic convergence algorithms f or dis- tributed consensus and coupling-constrained optimizatio n,” arXiv preprint arXiv:2511.19714, 2025

work page arXiv 2025
[7]

Momentum-based variance re duction in non-convex sgd,

A. Cutkosky and F. Orabona, “Momentum-based variance re duction in non-convex sgd,” Advances in neural information processing systems , vol. 32, 2019

2019
[8]

Decomposing linearly constrained nonconvex p roblems by a proximal primal dual approach: Algorithms, convergenc e, and applications,

M. Hong, “Decomposing linearly constrained nonconvex p roblems by a proximal primal dual approach: Algorithms, convergenc e, and applications,” arXiv preprint arXiv:1604.00543 , 2016

work page arXiv 2016
[9]

A global dual error bound and its a pplication to the analysis of linearly constrained nonconvex optimiza tion,

J. Zhang and Z.-Q. Luo, “A global dual error bound and its a pplication to the analysis of linearly constrained nonconvex optimiza tion,” SIAM Journal on Optimization , vol. 30, no. 4, pp. 3345–3375, 2020

2020
[10]

Global convergence of admm in nonconvex nonsmooth optimization,

Y . Wang, W. Yin, and J. Zeng, “Global convergence of admm in nonconvex nonsmooth optimization,” Journal of Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, 2019

2019
[11]

A stochastic primal-dual method for a class of nonconvex constrained optimization,

L. Jin and X. Wang, “A stochastic primal-dual method for a class of nonconvex constrained optimization,” Computational Optimization and Applications , vol. 83, no. 1, pp. 143–180, 2022

2022
[12]

A single-loop gradie nt descent and perturbed ascent algorithm for nonconvex functional const rained opti- mization,

T. Lin, Z. Zheng, and M. I. Jordan, “A single-loop gradie nt descent and perturbed ascent algorithm for nonconvex functional const rained opti- mization,” in International Conference on Machine Learning (ICML) . PMLR, 2020, pp. 6080–6090

2020
[13]

Accelerating stochastic grad ient descent using predictive variance reduction,

R. Johnson and T. Zhang, “Accelerating stochastic grad ient descent using predictive variance reduction,” Advances in neural information processing systems, vol. 26, 2013

2013
[14]

Sar ah: A novel method for machine learning problems using stochasti c recursive gradient,

L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c, “Sar ah: A novel method for machine learning problems using stochasti c recursive gradient,” in International conference on machine learning . PMLR, 2017, pp. 2613–2621

2017
[15]

Fast-and-light stochastic adm m

S. Zheng and J. T. Kwok, “Fast-and-light stochastic adm m.” in IJCAI, 2016, pp. 2407–2613

2016
[16]

An accelerated sto chastic admm for nonconvex and nonsmooth ﬁnite-sum optimization,

Y . Zeng, Z. Wang, J. Bai, and X. Shen, “An accelerated sto chastic admm for nonconvex and nonsmooth ﬁnite-sum optimization,” Auto- matica, vol. 163, p. 111554, 2024

2024
[17]

Complexity of single loo p algorithms for nonlinear programming with stochastic objective and co nstraints,

A. Alacaoglu and S. J. Wright, “Complexity of single loo p algorithms for nonlinear programming with stochastic objective and co nstraints,” in International Conference on Artiﬁcial Intelligence and St atistics. PMLR, 2024, pp. 4627–4635

2024
[18]

Faster stochastic alte rnating direction method of multipliers for nonconvex optimizatio n,

F. Huang, S. Chen, and H. Huang, “Faster stochastic alte rnating direction method of multipliers for nonconvex optimizatio n,” in Inter- national conference on machine learning . PMLR, 2019, pp. 2839– 2848

2019
[19]

V ariance-reduced ﬁrst-orde r methods for deterministically constrained stochastic nonconvex o ptimization with strong convergence guarantees,

Z. Lu, S. Mei, and Y . Xiao, “V ariance-reduced ﬁrst-orde r methods for deterministically constrained stochastic nonconvex o ptimization with strong convergence guarantees,” SIAM Journal on Optimization , vol. 36, no. 1, pp. 1–31, 2026

2026
[20]

Stochastic smoot hed primal- dual algorithms for nonconvex optimization with linear ine quality constraints,

R. Huang, J. Zhang, and A. Alacaoglu, “Stochastic smoot hed primal- dual algorithms for nonconvex optimization with linear ine quality constraints,” arXiv preprint arXiv:2504.07607 , 2025

work page arXiv 2025
[21]

Momentum improves normalize d SGD,

A. Cutkosky and H. Mehta, “Momentum improves normalize d SGD,” in International Conference on Machine Learning , 2020, pp. 2260– 2268

2020
[22]

Lea rning with the maximum correntropy criterion induced losses for regre ssion,

Y . Feng, X. Huang, L. Shi, Y . Y ang, and J. A. Suykens, “Lea rning with the maximum correntropy criterion induced losses for regre ssion,” The Journal of Machine Learning Research , vol. 16, no. 1, pp. 993–1034, 2015

2015