SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems
Pith reviewed 2026-05-25 00:02 UTC · model grok-4.3
The pith
SNAP finds (ε, √ε)-approximate second-order stationary points of non-convex linearly constrained problems in O(1/ε^{2.5}) iterations with polynomial per-step cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For generic instances, strict complementarity holds at all KKT solutions with probability one; this equivalence lets the SNAP algorithm, which successively applies gradient projection or negative-curvature projection, compute an (ε, √ε)-SOSP in O(1/ε^{2.5}) iterations with per-iteration cost polynomial in the number of constraints and the dimension.
What carries the argument
The Successive Negative-curvature grAdient Projection (SNAP) procedure that alternates conventional gradient projection steps with negative-curvature-based projection steps.
If this is right
- First-order methods now possess global sublinear rates for second-order stationarity on non-convex problems with linear constraints.
- Per-iteration work stays polynomial even when the number of linear constraints is large.
- The same iteration bound applies to a purely first-order variant SNAP+.
- Generic linear-constraint problems become tractable for second-order methods under the stated genericity condition.
Where Pith is reading between the lines
- Many practical linearly constrained non-convex problems may therefore admit efficient second-order solutions despite worst-case intractability results.
- The same strict-complementarity argument could be tested on problems with other simple constraint sets such as bound constraints or simplices.
- Negative-curvature steps may be combinable with other first-order primitives such as accelerated gradient methods while preserving the rate.
Load-bearing premise
The strict complementarity condition holds at every Karush-Kuhn-Tucker solution for a generic problem instance with probability one.
What would settle it
A concrete non-generic instance in which SNAP or SNAP+ requires more than O(1/ε^{2.5}) iterations to reach an (ε, √ε)-SOSP, or in which the two SOSP notions are inequivalent.
Figures
read the original abstract
This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementarity (SC) condition holds for all Karush-Kuhn-Tucker (KKT) solutions (with probability one). The SC condition is then used to establish an equivalence relationship between two different notions of SOSPs, one of which is computationally easy to verify. Based on this particular notion of SOSP, we design an algorithm named the Successive Negative-curvature grAdient Projection (SNAP), which successively performs either conventional gradient projection or some negative curvature based projection steps to find SOSPs. SNAP and its first-order extension SNAP$^+$, require $\mathcal{O}(1/\epsilon^{2.5})$ iterations to compute an $(\epsilon, \sqrt{\epsilon})$-SOSP, and their per-iteration computational complexities are polynomial in the number of constraints and problem dimension. To our knowledge, this is the first time that first-order algorithms with polynomial per-iteration complexity and global sublinear rate have been designed to find SOSPs of the important class of non-convex problems with linear constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SNAP and its first-order variant SNAP+ for computing approximate second-order stationary points of smooth non-convex objectives subject to linear constraints. Under a generic strict complementarity (SC) condition that holds with probability one at all KKT points, the authors establish an equivalence between the standard (ε,√ε)-SOSP notion and a computationally tractable variant based on negative curvature in the projected tangent space. They then design successive negative-curvature gradient projection steps that achieve O(1/ε^{2.5}) iterations with per-iteration cost polynomial in dimension and number of constraints.
Significance. If the genericity argument and equivalence hold, the work supplies the first first-order methods with polynomial per-iteration complexity and global sublinear rate for SOSPs under linear constraints, a setting where second-order methods are typically required or rates are worse.
major comments (2)
- [SC condition and equivalence lemma] The O(1/ε^{2.5}) iteration bound and the claim that it certifies a standard (ε,√ε)-SOSP rest entirely on the SC genericity argument and the subsequent equivalence lemma. The manuscript must exhibit the precise measure-zero set in the data space and confirm that the equivalence covers every KKT point without additional non-generic conditions (see the paragraph following the SC statement in the abstract and the corresponding lemma in the main text).
- [Algorithm description and complexity analysis] The per-iteration complexity is stated to be polynomial in d and m, yet the negative-curvature projection step requires solving a linear system or eigenvalue problem on the tangent space at each iteration; the manuscript should supply the exact arithmetic complexity (including the cost of the projection oracle) to substantiate the polynomial claim.
minor comments (2)
- [Abstract and main theorem] The abstract states the rate as O(1/ε^{2.5}); confirm that the same exponent appears in the theorem statement and that the hidden constants are independent of the constraint matrix rank.
- [Preliminaries] Notation for the two SOSP notions should be introduced once and used consistently; currently the distinction between the standard and tractable variants is introduced only informally.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will incorporate the suggested clarifications into a revised manuscript.
read point-by-point responses
-
Referee: [SC condition and equivalence lemma] The O(1/ε^{2.5}) iteration bound and the claim that it certifies a standard (ε,√ε)-SOSP rest entirely on the SC genericity argument and the subsequent equivalence lemma. The manuscript must exhibit the precise measure-zero set in the data space and confirm that the equivalence covers every KKT point without additional non-generic conditions (see the paragraph following the SC statement in the abstract and the corresponding lemma in the main text).
Authors: We agree that a more explicit characterization strengthens the presentation. In the revision we will define the measure-zero set in the data space (objective gradient/Hessian entries together with the entries of the constraint matrix A) as the algebraic variety on which either (i) the gradients of the active constraints at a KKT point become linearly dependent or (ii) the projected Hessian onto the tangent space has a zero eigenvalue. We will prove that this set has Lebesgue measure zero and that, outside this set, the strict-complementarity condition holds at every KKT point. Consequently the equivalence lemma between the standard (ε,√ε)-SOSP and the computationally tractable negative-curvature notion holds for all KKT points with no further non-generic assumptions required. The abstract paragraph and the lemma statement will be updated accordingly. revision: yes
-
Referee: [Algorithm description and complexity analysis] The per-iteration complexity is stated to be polynomial in d and m, yet the negative-curvature projection step requires solving a linear system or eigenvalue problem on the tangent space at each iteration; the manuscript should supply the exact arithmetic complexity (including the cost of the projection oracle) to substantiate the polynomial claim.
Authors: We accept the request for explicit arithmetic bounds. The projection oracle onto the tangent space is realized by a QR factorization of the active-constraint matrix (cost O(d m²) arithmetic operations). The negative-curvature direction is obtained by computing the smallest eigenpair of the projected Hessian, a symmetric matrix of size at most (d−m)×(d−m); this can be performed with the QR algorithm in O((d−m)³) operations or faster with Lanczos when only the smallest eigenvalue is needed. Both quantities are polynomial in d and m. A new subsection will be added that states these exact operation counts and confirms that the overall per-iteration cost remains polynomial. revision: yes
Circularity Check
No circularity; derivation self-contained via standard genericity argument
full rationale
The paper invokes a standard measure-theoretic genericity result (SC holds w.p. 1 for generic data) to equate two SOSP notions, then derives an O(1/ε^{2.5}) rate for the tractable variant under that assumption. No quoted step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the equivalence lemma and complexity bound are obtained by direct algorithmic analysis once the generic case is fixed. This matches the most common honest non-finding for papers that condition on generic instances.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Strict complementarity condition holds for all KKT solutions with probability one on generic instances
- standard math Standard KKT conditions and first-order optimality theory apply to the problem class
Reference graph
Works this paper leans on
-
[1]
Learning the parts of objects by non-negative matrix factorization,
D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999
work page 1999
-
[2]
Deep learning without poor local minima,
K. Kawaguchi, “Deep learning without poor local minima, ” in Proceedings of Neural Information Pro- cessing Systems (NIPS) , pp. 586–594, 2016
work page 2016
-
[3]
Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,
M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory , vol. 65, pp. 742–769, Feb. 2019
work page 2019
-
[4]
Escaping from saddle points—online stochastic gradient for tensor decomposition,
R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 797–842, 2015
work page 2015
-
[5]
A Geometric Analysis of Phase Retrieval
J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phas e retrieval,” arXiv:1602.06664 [cs.IT] , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
No spurious local minima in no nconvex low rank problems: A unified geometric analysis,
R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in no nconvex low rank problems: A unified geometric analysis,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1233– 1242, 2017
work page 2017
-
[7]
Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,
C. W. Royer and S. J. Wright, “Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1448–1477, 2018
work page 2018
-
[8]
Accele rated methods for nonconvex optimization,
Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accele rated methods for nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1751–1772, 2018
work page 2018
-
[9]
Finding approximate local minima faster than gradient descent,
N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma, “Finding approximate local minima faster than gradient descent,” in Proceedings of Annual ACM Symposium on the Theory of Computin g (STOC), pp. 1195–1199, 2017
work page 2017
-
[10]
A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,
S. Paternain, A. Mokhtari, and A. Ribeiro, “A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,” SIAM Journal on Optimization , vol. 29, no. 1, pp. 343–368, 2019
work page 2019
-
[11]
A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,
C. W. Royer, M. O’Neill, and S. J. Wright, “A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,” Mathematical Programming, Jan. 2019
work page 2019
-
[12]
Gra dient descent only converges to minimizers,
J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 1246–1257, 2016
work page 2016
-
[13]
How to escape saddle points efficiently,
C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points efficiently,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1724–1732, 2017
work page 2017
-
[14]
First-order stochastic alg orithms for escaping from saddle points in almost linear time,
Y. Xu, J. Rong, and T. Yang, “First-order stochastic alg orithms for escaping from saddle points in almost linear time,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 5535–5545, 2018
work page 2018
-
[15]
Neon2: Finding local minima via first-order oracles,
Z. Allen-Zhu and Y. Li, “Neon2: Finding local minima via first-order oracles,” in Proceedings of Neural Information Processing Systems (NeurIPS) , pp. 3720–3730, 2018
work page 2018
-
[16]
S. Lu, M. Hong, and Z. Wang, “PA-GD: On the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex o ptimization,” in Proceedings of International Conference on Machine Learning (ICML) , pp. 4134–4143, 2019
work page 2019
-
[17]
Using negati ve curvature in solving nonlinear programs,
D. Goldfarb, C. Mu, J. Wright, and C. Zhou, “Using negati ve curvature in solving nonlinear programs,” Computational Optimization and Applications , vol. 68, pp. 479–502, Dec. 2017. 17
work page 2017
-
[18]
M. Hong, J. D. Lee, and M. Razaviyayn, “Gradient primal- dual algorithm converges to second-order stationary solutions for nonconvex distributed optimizat ion,” in Proceedings of International Conference on Machine Learning (ICML) , 2018
work page 2018
-
[19]
Learning understandabl e neural networks with nonnegative weight constraints,
J. Chorowski and J. M. Zurada, “Learning understandabl e neural networks with nonnegative weight constraints,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, pp. 62–69, Jan. 2015
work page 2015
-
[20]
Tensor decomposition for signal processing and machine learning,
N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E . Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning, ” IEEE Transactions on Signal Processing , vol. 65, no. 13, pp. 3551–3582, 2017
work page 2017
-
[21]
On nonconvex quadratic pr ogramming with box constraints,
S. Burer and A. N. Letchford, “On nonconvex quadratic pr ogramming with box constraints,” SIAM Journal on Optimization , vol. 20, no. 2, pp. 1073–1089, 2009
work page 2009
-
[22]
Second-order negative-curvature methods for box-constrained and general constrained optim ization,
R. Andreani, E. G. Birgin, J. M. Martínez, and M. L. Schuv erdt, “Second-order negative-curvature methods for box-constrained and general constrained optim ization,” Computational Optimization and Applications, vol. 45, no. 2, pp. 209–236, 2010
work page 2010
-
[23]
A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust region methods. SIAM, 2000
work page 2000
-
[24]
C. Cartis, N. I. Gould, and P. L. Toint, “Second-order op timality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear o ptimization,” Foundations of Computational Mathematics, vol. 18, no. 5, pp. 1073–1107, 2018
work page 2018
-
[25]
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient langevin dynamics: A nonasymptotic analysis,” arXiv preprint arXiv:1702.03849 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
G. Di Pillo, S. Lucidi, and L. Palagi, “Convergence to se cond-order stationary points of a primal- dual algorithm model for nonlinear programming,” Mathematics of Operations Research , vol. 30, no. 4, pp. 897–915, 2005
work page 2005
-
[27]
Convergence to second-order stationarity for constrained non-convex optimization,
M. Nouiehed, J. D. Lee, and M. Razaviyayn, “Convergence to second-order stationarity for constrained non-convex optimization,” arXiv preprint arXiv:1810.02024 , 2018
-
[28]
Escaping s addle points in constrained optimization,
A. Mokhtari, A. Ozdaglar, and A. Jadbabaie, “Escaping s addle points in constrained optimization,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 3629–3639, 2018
work page 2018
-
[29]
Information-based complexity of conv ex programming,
A. Nemirovski, “Information-based complexity of conv ex programming,” Lecture Notes, 1995
work page 1995
-
[30]
D. P. Bertsekas, Nonlinear Programming, 2nd ed . Belmont, MA: Athena Scientific, 1999
work page 1999
-
[31]
Some np-complete problems in quadratic and nonlinear programming,
K. G. Murty and S. N. Kabadi, “Some np-complete problems in quadratic and nonlinear programming,” Mathematical programming, vol. 39, no. 2, pp. 117–129, 1987
work page 1987
-
[32]
Global conver gence of a class of trust region algorithms for optimization with simple bounds,
A. R. Conn, N. I. M. Gould, and P. L. Toint, “Global conver gence of a class of trust region algorithms for optimization with simple bounds,” SIAM Journal on Numerical Analysis , vol. 25, no. 2, pp. 433–460, 1989
work page 1989
-
[33]
Newton’s method for large boun d-constrained optimization problems,
C.-J. Lin and J. J. Moré, “Newton’s method for large boun d-constrained optimization problems,” SIAM Journal on Optimization , vol. 9, no. 4, pp. 1100–1127, 1999
work page 1999
-
[34]
M. Lescrenier, “Convergence of trust region algorithm s for optimization with bounds when strict com- plementarity does not hold,” SIAM Journal on Numerical Analysis , vol. 28, no. 2, pp. 476–495, 1991
work page 1991
-
[35]
J. Zhang and Z.-Q. Luo, “A proximal alternating directi on method of multiplier for linearly constrained nonconvex minimization,” arXiv preprint arXiv:1812.10229v2 , 2019. 18
-
[36]
D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods . Academic Press, 2014
work page 2014
-
[37]
Two-metric projection methods for constrained optimization,
E. M. Gafni and D. P. Bertsekas, “Two-metric projection methods for constrained optimization,” SIAM Journal on Control and Optimization , vol. 22, no. 6, pp. 936–964, 1984
work page 1984
-
[38]
On the accurat e identification of active constraints,
F. Facchinei, A. Fischer, and C. Kanzow, “On the accurat e identification of active constraints,” SIAM Journal on Optimization , vol. 9, no. 1, pp. 14–32, 1998
work page 1998
-
[39]
Convergence to second orde r stationary points in inequality constrained optimization,
F. Facchinei and S. Lucidi, “Convergence to second orde r stationary points in inequality constrained optimization,” Mathematics of Operations Research , vol. 23, no. 3, pp. 746–766, 1998
work page 1998
-
[40]
M. Nouiehed and M. Razaviyayn, “A trust region method fo r finding second-order stationarity in linearly constrained non-convex optimization,” arXiv preprint arXiv:1904.06784 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[41]
A database for handwritten text recognitio n research,
J. J. Hull, “A database for handwritten text recognitio n research,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 16, no. 5, pp. 550–554, 1994
work page 1994
-
[42]
S. Wang, T. Chang, Y. Cui, and J. Pang, “Clustering by ort hogonal non-negative matrix factorization: A sequential non-convex penalty approach,” in Proceedings of IEEE International Conference on Acoustics Speech and Signal Process. (ICASSP) , pp. 5576–5580, May 2019. 19 A Proofs Related to Stationary Points A.1 Proof of Proposition 2 Proof. When f (x) = ...
work page 2019
-
[43]
(A′(x(r))d(r))i≤ 0: any α> 0 can satisfy (63)
-
[44]
(A′(x(r))d(r))i > 0: we need α≤ (b′− A′(x(r))x(r))i/ (A′(x(r))d(r))i. That is, going along the current direction far enough will eventually reach the boundary of t he feasible set. Then it follows if there exists a finite step-size α so that (A′(x(r))(x(r) +α d(r)))i = b′ i, (64) we can easily compute α (r) max in the closed-form by (26). 25 C Proofs of SN...
-
[45]
since in line 8 of Algorithm 1 we know from the algorithm that−qπ (x(r)) is chosen when −qπ (x(r)) T v(x(r))L1ǫ′ H (δ) L2 + 63L1ǫ′3 H (δ) 128L2 2 ≤∥ qπ (x(r))∥2, q π (x(r)) T v(x(r))≤ 0. This completes the proof. Lemma 6. If d(r) is chosen by v(x(r)), x(r+1) is computed by the NCD procedure in Algorithm 1 and α (r) max is not selected, then the line searc...
-
[46]
06ǫ′3 H(δ) L2 2 , ǫ2 G 18L1 } 1 min{d,m}, (84) we obtain r≤ f (x(1))− f ⋆ ∆′ . (85) Since the probability that eigen-pair fails to extract the n egative curvature isδ, applying the union bound, we only need to set δ′ =δ(f (x(1)− f ⋆)/ ∆′ so that we can have the claim that SNAP will output approximat e SOSP1s with probability 1− δ′. Note that γǫ′ H(δ)>ǫ H....
-
[47]
Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0
or the PGD step, otherwise the algorithm will stop and output an (ǫG,ǫ H )-SOSP1 solution. Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0. 06ǫ′3 H(δ)/L 2 2 } ,∀r> min{d,m}. (90) Summarizing the argument so far in the second case, we have th at, after every consecutive min{d,m} iterations of the algorithm, either the algorithms sto...
-
[48]
∥gπ (x(r))∥≥ ǫG: descent per-iteration is ǫ2 G 18L1 by Lemma 4
-
[49]
∥gπ (x(r))∥≤ ǫG 32 • flagα =∅ and r− rlast <r th: descent per-iteration: 0. 06ǫ′3 H(δ)/ (rthL2
-
[50]
• flagα = ♦ or r− rlast≥ rth: – flag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1
by (89). • flagα = ♦ or r− rlast≥ rth: – flag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1. We output x(r). – flag = ♦ (there is a negative curvature): ∗ flagα =∅ (boundary not touched): descent per-iteration is at least 0. 06 ǫ′3 H(δ) L2 2 by (88). ∗ flagα = ♦ (boundary touched): descent per-iteration min{ ǫ2 G 18L1 , 0. 06 ǫ′3 H (δ) L2 2 }/ min{d,m} by...
-
[51]
From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf
Case T ′≤ T ∗ : Applying Lemma 9, we know that f (x + u(T ′))− f (x)−∇ f (x) T u(T ′)≤ f (x + u(1))− f (x)−∇ f (x) T u(1)− 2F (a) ≤ L1 2∥u(1)∥2− 2F (b) ≤− 2F (143) 39 where (a) is true because of the L1-gradient Lipschitz continuity; (b) is true because u(1) = 0. From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf . ...
-
[52]
Define T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F }
Case T ′ > T∗: Applying Lemma 9, we know that ∥u(r)− u(1)∥≤ 3S for r < T∗ . Define T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F } . Then, after applying Lemma 10, we know T ′′< T∗ . Using the same argument as the above case, for T≥ ˆcT =T ∗ >T ′′, we also have f (x + w(T ))− f (x)−∇ f (x) T w(T )≤ f (w(T ∗ ))− f (x)−∇ f (x) T w(T ∗ ) ≤f (w(T ′′))− f (x)−∇ f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.