SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems

Bo Yang; Kejun Huang; Meisam Razaviyayn; Mingyi Hong; Songtao Lu

arxiv: 1907.04450 · v1 · pith:Q4E42MHInew · submitted 2019-07-09 · 🧮 math.OC · cs.CC· stat.ML

SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems

Songtao Lu , Meisam Razaviyayn , Bo Yang , Kejun Huang , Mingyi Hong This is my paper

Pith reviewed 2026-05-25 00:02 UTC · model grok-4.3

classification 🧮 math.OC cs.CCstat.ML

keywords non-convex optimizationsecond-order stationary pointslinear constraintsgradient projectionnegative curvatureconstrained optimizationfirst-order methods

0 comments

The pith

SNAP finds (ε, √ε)-approximate second-order stationary points of non-convex linearly constrained problems in O(1/ε^{2.5}) iterations with polynomial per-step cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that for generic instances of smooth non-convex problems with linear constraints, a strict complementarity condition holds at every KKT point with probability one. This condition makes two different definitions of second-order stationarity equivalent, and one of them is computationally tractable to check. The authors use the tractable notion to build the SNAP algorithm, which alternates standard gradient-projection steps with negative-curvature projection steps. Both SNAP and its first-order variant SNAP+ reach an (ε, √ε)-SOSP after O(1/ε^{2.5}) iterations while keeping each iteration polynomial in dimension and number of constraints. This supplies the first global sublinear rate for first-order methods on this problem class.

Core claim

For generic instances, strict complementarity holds at all KKT solutions with probability one; this equivalence lets the SNAP algorithm, which successively applies gradient projection or negative-curvature projection, compute an (ε, √ε)-SOSP in O(1/ε^{2.5}) iterations with per-iteration cost polynomial in the number of constraints and the dimension.

What carries the argument

The Successive Negative-curvature grAdient Projection (SNAP) procedure that alternates conventional gradient projection steps with negative-curvature-based projection steps.

If this is right

First-order methods now possess global sublinear rates for second-order stationarity on non-convex problems with linear constraints.
Per-iteration work stays polynomial even when the number of linear constraints is large.
The same iteration bound applies to a purely first-order variant SNAP+.
Generic linear-constraint problems become tractable for second-order methods under the stated genericity condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many practical linearly constrained non-convex problems may therefore admit efficient second-order solutions despite worst-case intractability results.
The same strict-complementarity argument could be tested on problems with other simple constraint sets such as bound constraints or simplices.
Negative-curvature steps may be combinable with other first-order primitives such as accelerated gradient methods while preserving the rate.

Load-bearing premise

The strict complementarity condition holds at every Karush-Kuhn-Tucker solution for a generic problem instance with probability one.

What would settle it

A concrete non-generic instance in which SNAP or SNAP+ requires more than O(1/ε^{2.5}) iterations to reach an (ε, √ε)-SOSP, or in which the two SOSP notions are inequivalent.

Figures

Figures reproduced from arXiv: 1907.04450 by Bo Yang, Kejun Huang, Meisam Razaviyayn, Mingyi Hong, Songtao Lu.

**Figure 1.** Figure 1: The convergence behaviors of SNAP, SNAP+, PGD, PGD-LS for NMF, where c = 1. 7.1.1 Synthetic Dataset We compare the proposed SNAP, SNAP+ with PGD and PGD-LS on the synthetic dataset for MNF problem and show the advantages of exploiting negative curvatures. The data matrices are randomly generated, where m = 20, n = 50, k = 10, M = WHT , and [W; H] ∈ R (n+m)×k follows the uniform distribution in the interval… view at source ↗

**Figure 2.** Figure 2: The convergence behaviors of SNAP, SNAP+, PGD, PGD-LS for NMF, where c = 1 × 10−5 . SNAP SNAP+ PGD PGD-LS (a) Loss value versus iteration SNAP SNAP+ PGD PGD-LS (b) Loss value versus computational time [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The convergence behaviors of SNAP, SNAP+, PGD, PGD-LS for NMF, where c = 1 × 10−10 . CN (0, 1). Clearly, the origin point is a strict saddle point. We use three different constants c to initialize sequence X(r) and the results are shown in [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: The convergence behaviors of SNAP+, PGD, PGD-LS for NMF, where c = 1 × 10−10 . occasionally rather than each step, so the computational time is not as high as PGD-LS. In [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The convergence behaviors of SNAP+, PGD, PGD-LS for NNN, where c = 1. constraint as the following, min X∈Rn×k kM − HHT k s.t. H ≥ 0, HT 1 = 1. In the numerical experiments, the data is generated similar as the NMF case, where n = 100, k = 5 and each column of H is normalized. We set απ = 1 × 10−2 , T = 100, rth = 100, R = 1 × 10−4 and F = 100. From [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: The convergence behaviors of SNAP+, PGD, PGD-LS for matrix factorization under simplex constraints, where c = 1 × 10−10 . 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: The convergence behaviors of SNAP+, PGD, PGD-LS for penalized NMF, where c = 1 × 10−10 . Iteration (r) Loss value SNAP+ PGD PGD-LS (a) Loss value versus iteration Computational time (s) Loss value SNAP+ PGD PGD-LS (b) Loss value versus computational time [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The convergence behaviors of SNAP+, PGD, PGD-LS for penalized NMF, where c = 1. 7.4 Penalized NMF We also consider a penalized version of NMF, i.e., min W∈Rn×k,H∈Rm×k kWHT − Mk 2 + ρ 2 Xm i [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementarity (SC) condition holds for all Karush-Kuhn-Tucker (KKT) solutions (with probability one). The SC condition is then used to establish an equivalence relationship between two different notions of SOSPs, one of which is computationally easy to verify. Based on this particular notion of SOSP, we design an algorithm named the Successive Negative-curvature grAdient Projection (SNAP), which successively performs either conventional gradient projection or some negative curvature based projection steps to find SOSPs. SNAP and its first-order extension SNAP$^+$, require $\mathcal{O}(1/\epsilon^{2.5})$ iterations to compute an $(\epsilon, \sqrt{\epsilon})$-SOSP, and their per-iteration computational complexities are polynomial in the number of constraints and problem dimension. To our knowledge, this is the first time that first-order algorithms with polynomial per-iteration complexity and global sublinear rate have been designed to find SOSPs of the important class of non-convex problems with linear constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SNAP gives the first first-order method with O(1/ε^{2.5}) rate and poly(d,m) per-step cost for approximate SOSPs in non-convex linear-constrained problems, via a strict-complementarity reduction that equates two SOSP notions on generic instances.

read the letter

The core contribution is the reduction that uses strict complementarity (SC) holding with probability one on generic data to equate the standard (ε, √ε)-SOSP with a version whose negative curvature can be found after projection onto the tangent space. From there they build SNAP, which mixes ordinary gradient-projection steps with negative-curvature projection steps, and the first-order-only SNAP+ variant. Both reach the target in O(1/ε^{2.5}) iterations with per-iteration work polynomial in dimension and number of constraints. That rate and the first-order restriction are new for this problem class; earlier work either used second-order oracles or lacked global rates under linear constraints.

Referee Report

2 major / 2 minor

Summary. The paper proposes SNAP and its first-order variant SNAP+ for computing approximate second-order stationary points of smooth non-convex objectives subject to linear constraints. Under a generic strict complementarity (SC) condition that holds with probability one at all KKT points, the authors establish an equivalence between the standard (ε,√ε)-SOSP notion and a computationally tractable variant based on negative curvature in the projected tangent space. They then design successive negative-curvature gradient projection steps that achieve O(1/ε^{2.5}) iterations with per-iteration cost polynomial in dimension and number of constraints.

Significance. If the genericity argument and equivalence hold, the work supplies the first first-order methods with polynomial per-iteration complexity and global sublinear rate for SOSPs under linear constraints, a setting where second-order methods are typically required or rates are worse.

major comments (2)

[SC condition and equivalence lemma] The O(1/ε^{2.5}) iteration bound and the claim that it certifies a standard (ε,√ε)-SOSP rest entirely on the SC genericity argument and the subsequent equivalence lemma. The manuscript must exhibit the precise measure-zero set in the data space and confirm that the equivalence covers every KKT point without additional non-generic conditions (see the paragraph following the SC statement in the abstract and the corresponding lemma in the main text).
[Algorithm description and complexity analysis] The per-iteration complexity is stated to be polynomial in d and m, yet the negative-curvature projection step requires solving a linear system or eigenvalue problem on the tangent space at each iteration; the manuscript should supply the exact arithmetic complexity (including the cost of the projection oracle) to substantiate the polynomial claim.

minor comments (2)

[Abstract and main theorem] The abstract states the rate as O(1/ε^{2.5}); confirm that the same exponent appears in the theorem statement and that the hidden constants are independent of the constraint matrix rank.
[Preliminaries] Notation for the two SOSP notions should be introduced once and used consistently; currently the distinction between the standard and tractable variants is introduced only informally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will incorporate the suggested clarifications into a revised manuscript.

read point-by-point responses

Referee: [SC condition and equivalence lemma] The O(1/ε^{2.5}) iteration bound and the claim that it certifies a standard (ε,√ε)-SOSP rest entirely on the SC genericity argument and the subsequent equivalence lemma. The manuscript must exhibit the precise measure-zero set in the data space and confirm that the equivalence covers every KKT point without additional non-generic conditions (see the paragraph following the SC statement in the abstract and the corresponding lemma in the main text).

Authors: We agree that a more explicit characterization strengthens the presentation. In the revision we will define the measure-zero set in the data space (objective gradient/Hessian entries together with the entries of the constraint matrix A) as the algebraic variety on which either (i) the gradients of the active constraints at a KKT point become linearly dependent or (ii) the projected Hessian onto the tangent space has a zero eigenvalue. We will prove that this set has Lebesgue measure zero and that, outside this set, the strict-complementarity condition holds at every KKT point. Consequently the equivalence lemma between the standard (ε,√ε)-SOSP and the computationally tractable negative-curvature notion holds for all KKT points with no further non-generic assumptions required. The abstract paragraph and the lemma statement will be updated accordingly. revision: yes
Referee: [Algorithm description and complexity analysis] The per-iteration complexity is stated to be polynomial in d and m, yet the negative-curvature projection step requires solving a linear system or eigenvalue problem on the tangent space at each iteration; the manuscript should supply the exact arithmetic complexity (including the cost of the projection oracle) to substantiate the polynomial claim.

Authors: We accept the request for explicit arithmetic bounds. The projection oracle onto the tangent space is realized by a QR factorization of the active-constraint matrix (cost O(d m²) arithmetic operations). The negative-curvature direction is obtained by computing the smallest eigenpair of the projected Hessian, a symmetric matrix of size at most (d−m)×(d−m); this can be performed with the QR algorithm in O((d−m)³) operations or faster with Lanczos when only the smallest eigenvalue is needed. Both quantities are polynomial in d and m. A new subsection will be added that states these exact operation counts and confirms that the overall per-iteration cost remains polynomial. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained via standard genericity argument

full rationale

The paper invokes a standard measure-theoretic genericity result (SC holds w.p. 1 for generic data) to equate two SOSP notions, then derives an O(1/ε^{2.5}) rate for the tractable variant under that assumption. No quoted step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the equivalence lemma and complexity bound are obtained by direct algorithmic analysis once the generic case is fixed. This matches the most common honest non-finding for papers that condition on generic instances.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that strict complementarity holds with probability one for generic instances and on standard KKT theory; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Strict complementarity condition holds for all KKT solutions with probability one on generic instances
Invoked to establish equivalence between two notions of SOSP and enable the algorithm design.
standard math Standard KKT conditions and first-order optimality theory apply to the problem class
Background assumption from optimization theory used throughout.

pith-pipeline@v0.9.0 · 5795 in / 1419 out tokens · 26439 ms · 2026-05-25T00:02:41.695342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Learning the parts of objects by non-negative matrix factorization,

D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999

work page 1999
[2]

Deep learning without poor local minima,

K. Kawaguchi, “Deep learning without poor local minima, ” in Proceedings of Neural Information Pro- cessing Systems (NIPS) , pp. 586–594, 2016

work page 2016
[3]

Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,

M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory , vol. 65, pp. 742–769, Feb. 2019

work page 2019
[4]

Escaping from saddle points—online stochastic gradient for tensor decomposition,

R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 797–842, 2015

work page 2015
[5]

A Geometric Analysis of Phase Retrieval

J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phas e retrieval,” arXiv:1602.06664 [cs.IT] , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

No spurious local minima in no nconvex low rank problems: A uniﬁed geometric analysis,

R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in no nconvex low rank problems: A uniﬁed geometric analysis,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1233– 1242, 2017

work page 2017
[7]

Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,

C. W. Royer and S. J. Wright, “Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1448–1477, 2018

work page 2018
[8]

Accele rated methods for nonconvex optimization,

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accele rated methods for nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1751–1772, 2018

work page 2018
[9]

Finding approximate local minima faster than gradient descent,

N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma, “Finding approximate local minima faster than gradient descent,” in Proceedings of Annual ACM Symposium on the Theory of Computin g (STOC), pp. 1195–1199, 2017

work page 2017
[10]

A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,

S. Paternain, A. Mokhtari, and A. Ribeiro, “A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,” SIAM Journal on Optimization , vol. 29, no. 1, pp. 343–368, 2019

work page 2019
[11]

A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,

C. W. Royer, M. O’Neill, and S. J. Wright, “A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,” Mathematical Programming, Jan. 2019

work page 2019
[12]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 1246–1257, 2016

work page 2016
[13]

How to escape saddle points eﬃciently,

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points eﬃciently,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1724–1732, 2017

work page 2017
[14]

First-order stochastic alg orithms for escaping from saddle points in almost linear time,

Y. Xu, J. Rong, and T. Yang, “First-order stochastic alg orithms for escaping from saddle points in almost linear time,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 5535–5545, 2018

work page 2018
[15]

Neon2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y. Li, “Neon2: Finding local minima via ﬁrst-order oracles,” in Proceedings of Neural Information Processing Systems (NeurIPS) , pp. 3720–3730, 2018

work page 2018
[16]

PA-GD: On the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex o ptimization,

S. Lu, M. Hong, and Z. Wang, “PA-GD: On the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex o ptimization,” in Proceedings of International Conference on Machine Learning (ICML) , pp. 4134–4143, 2019

work page 2019
[17]

Using negati ve curvature in solving nonlinear programs,

D. Goldfarb, C. Mu, J. Wright, and C. Zhou, “Using negati ve curvature in solving nonlinear programs,” Computational Optimization and Applications , vol. 68, pp. 479–502, Dec. 2017. 17

work page 2017
[18]

Gradient primal- dual algorithm converges to second-order stationary solutions for nonconvex distributed optimizat ion,

M. Hong, J. D. Lee, and M. Razaviyayn, “Gradient primal- dual algorithm converges to second-order stationary solutions for nonconvex distributed optimizat ion,” in Proceedings of International Conference on Machine Learning (ICML) , 2018

work page 2018
[19]

Learning understandabl e neural networks with nonnegative weight constraints,

J. Chorowski and J. M. Zurada, “Learning understandabl e neural networks with nonnegative weight constraints,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, pp. 62–69, Jan. 2015

work page 2015
[20]

Tensor decomposition for signal processing and machine learning,

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E . Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning, ” IEEE Transactions on Signal Processing , vol. 65, no. 13, pp. 3551–3582, 2017

work page 2017
[21]

On nonconvex quadratic pr ogramming with box constraints,

S. Burer and A. N. Letchford, “On nonconvex quadratic pr ogramming with box constraints,” SIAM Journal on Optimization , vol. 20, no. 2, pp. 1073–1089, 2009

work page 2009
[22]

Second-order negative-curvature methods for box-constrained and general constrained optim ization,

R. Andreani, E. G. Birgin, J. M. Martínez, and M. L. Schuv erdt, “Second-order negative-curvature methods for box-constrained and general constrained optim ization,” Computational Optimization and Applications, vol. 45, no. 2, pp. 209–236, 2010

work page 2010
[23]

A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust region methods. SIAM, 2000

work page 2000
[24]

Second-order op timality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear o ptimization,

C. Cartis, N. I. Gould, and P. L. Toint, “Second-order op timality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear o ptimization,” Foundations of Computational Mathematics, vol. 18, no. 5, pp. 1073–1107, 2018

work page 2018
[25]

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient langevin dynamics: A nonasymptotic analysis,” arXiv preprint arXiv:1702.03849 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Convergence to se cond-order stationary points of a primal- dual algorithm model for nonlinear programming,

G. Di Pillo, S. Lucidi, and L. Palagi, “Convergence to se cond-order stationary points of a primal- dual algorithm model for nonlinear programming,” Mathematics of Operations Research , vol. 30, no. 4, pp. 897–915, 2005

work page 2005
[27]

Convergence to second-order stationarity for constrained non-convex optimization,

M. Nouiehed, J. D. Lee, and M. Razaviyayn, “Convergence to second-order stationarity for constrained non-convex optimization,” arXiv preprint arXiv:1810.02024 , 2018

work page arXiv 2018
[28]

Escaping s addle points in constrained optimization,

A. Mokhtari, A. Ozdaglar, and A. Jadbabaie, “Escaping s addle points in constrained optimization,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 3629–3639, 2018

work page 2018
[29]

Information-based complexity of conv ex programming,

A. Nemirovski, “Information-based complexity of conv ex programming,” Lecture Notes, 1995

work page 1995
[30]

D. P. Bertsekas, Nonlinear Programming, 2nd ed . Belmont, MA: Athena Scientiﬁc, 1999

work page 1999
[31]

Some np-complete problems in quadratic and nonlinear programming,

K. G. Murty and S. N. Kabadi, “Some np-complete problems in quadratic and nonlinear programming,” Mathematical programming, vol. 39, no. 2, pp. 117–129, 1987

work page 1987
[32]

Global conver gence of a class of trust region algorithms for optimization with simple bounds,

A. R. Conn, N. I. M. Gould, and P. L. Toint, “Global conver gence of a class of trust region algorithms for optimization with simple bounds,” SIAM Journal on Numerical Analysis , vol. 25, no. 2, pp. 433–460, 1989

work page 1989
[33]

Newton’s method for large boun d-constrained optimization problems,

C.-J. Lin and J. J. Moré, “Newton’s method for large boun d-constrained optimization problems,” SIAM Journal on Optimization , vol. 9, no. 4, pp. 1100–1127, 1999

work page 1999
[34]

Convergence of trust region algorithm s for optimization with bounds when strict com- plementarity does not hold,

M. Lescrenier, “Convergence of trust region algorithm s for optimization with bounds when strict com- plementarity does not hold,” SIAM Journal on Numerical Analysis , vol. 28, no. 2, pp. 476–495, 1991

work page 1991
[35]

A proximal alternating directi on method of multiplier for linearly constrained nonconvex minimization,

J. Zhang and Z.-Q. Luo, “A proximal alternating directi on method of multiplier for linearly constrained nonconvex minimization,” arXiv preprint arXiv:1812.10229v2 , 2019. 18

work page arXiv 2019
[36]

D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods . Academic Press, 2014

work page 2014
[37]

Two-metric projection methods for constrained optimization,

E. M. Gafni and D. P. Bertsekas, “Two-metric projection methods for constrained optimization,” SIAM Journal on Control and Optimization , vol. 22, no. 6, pp. 936–964, 1984

work page 1984
[38]

On the accurat e identiﬁcation of active constraints,

F. Facchinei, A. Fischer, and C. Kanzow, “On the accurat e identiﬁcation of active constraints,” SIAM Journal on Optimization , vol. 9, no. 1, pp. 14–32, 1998

work page 1998
[39]

Convergence to second orde r stationary points in inequality constrained optimization,

F. Facchinei and S. Lucidi, “Convergence to second orde r stationary points in inequality constrained optimization,” Mathematics of Operations Research , vol. 23, no. 3, pp. 746–766, 1998

work page 1998
[40]

A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Non-Convex Optimization

M. Nouiehed and M. Razaviyayn, “A trust region method fo r ﬁnding second-order stationarity in linearly constrained non-convex optimization,” arXiv preprint arXiv:1904.06784 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[41]

A database for handwritten text recognitio n research,

J. J. Hull, “A database for handwritten text recognitio n research,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 16, no. 5, pp. 550–554, 1994

work page 1994
[42]

Clustering by ort hogonal non-negative matrix factorization: A sequential non-convex penalty approach,

S. Wang, T. Chang, Y. Cui, and J. Pang, “Clustering by ort hogonal non-negative matrix factorization: A sequential non-convex penalty approach,” in Proceedings of IEEE International Conference on Acoustics Speech and Signal Process. (ICASSP) , pp. 5576–5580, May 2019. 19 A Proofs Related to Stationary Points A.1 Proof of Proposition 2 Proof. When f (x) = ...

work page 2019
[43]

(A′(x(r))d(r))i≤ 0: any α> 0 can satisfy (63)

work page
[44]

That is, going along the current direction far enough will eventually reach the boundary of t he feasible set

(A′(x(r))d(r))i > 0: we need α≤ (b′− A′(x(r))x(r))i/ (A′(x(r))d(r))i. That is, going along the current direction far enough will eventually reach the boundary of t he feasible set. Then it follows if there exists a ﬁnite step-size α so that (A′(x(r))(x(r) +α d(r)))i = b′ i, (64) we can easily compute α (r) max in the closed-form by (26). 25 C Proofs of SN...

work page
[45]

This completes the proof

since in line 8 of Algorithm 1 we know from the algorithm that−qπ (x(r)) is chosen when −qπ (x(r)) T v(x(r))L1ǫ′ H (δ) L2 + 63L1ǫ′3 H (δ) 128L2 2 ≤∥ qπ (x(r))∥2, q π (x(r)) T v(x(r))≤ 0. This completes the proof. Lemma 6. If d(r) is chosen by v(x(r)), x(r+1) is computed by the NCD procedure in Algorithm 1 and α (r) max is not selected, then the line searc...

work page
[46]

06ǫ′3 H(δ) L2 2 , ǫ2 G 18L1 } 1 min{d,m}, (84) we obtain r≤ f (x(1))− f ⋆ ∆′ . (85) Since the probability that eigen-pair fails to extract the n egative curvature isδ, applying the union bound, we only need to set δ′ =δ(f (x(1)− f ⋆)/ ∆′ so that we can have the claim that SNAP will output approximat e SOSP1s with probability 1− δ′. Note that γǫ′ H(δ)>ǫ H....

work page
[47]

Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0

or the PGD step, otherwise the algorithm will stop and output an (ǫG,ǫ H )-SOSP1 solution. Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0. 06ǫ′3 H(δ)/L 2 2 } ,∀r> min{d,m}. (90) Summarizing the argument so far in the second case, we have th at, after every consecutive min{d,m} iterations of the algorithm, either the algorithms sto...

work page
[48]

∥gπ (x(r))∥≥ ǫG: descent per-iteration is ǫ2 G 18L1 by Lemma 4

work page
[49]

06ǫ′3 H(δ)/ (rthL2

∥gπ (x(r))∥≤ ǫG 32 • ﬂagα =∅ and r− rlast <r th: descent per-iteration: 0. 06ǫ′3 H(δ)/ (rthL2

work page
[50]

• ﬂagα = ♦ or r− rlast≥ rth: – ﬂag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1

by (89). • ﬂagα = ♦ or r− rlast≥ rth: – ﬂag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1. We output x(r). – ﬂag = ♦ (there is a negative curvature): ∗ ﬂagα =∅ (boundary not touched): descent per-iteration is at least 0. 06 ǫ′3 H(δ) L2 2 by (88). ∗ ﬂagα = ♦ (boundary touched): descent per-iteration min{ ǫ2 G 18L1 , 0. 06 ǫ′3 H (δ) L2 2 }/ min{d,m} by...

work page
[51]

From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf

Case T ′≤ T ∗ : Applying Lemma 9, we know that f (x + u(T ′))− f (x)−∇ f (x) T u(T ′)≤ f (x + u(1))− f (x)−∇ f (x) T u(1)− 2F (a) ≤ L1 2∥u(1)∥2− 2F (b) ≤− 2F (143) 39 where (a) is true because of the L1-gradient Lipschitz continuity; (b) is true because u(1) = 0. From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf . ...

work page
[52]

Deﬁne T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F }

Case T ′ > T∗: Applying Lemma 9, we know that ∥u(r)− u(1)∥≤ 3S for r < T∗ . Deﬁne T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F } . Then, after applying Lemma 10, we know T ′′< T∗ . Using the same argument as the above case, for T≥ ˆcT =T ∗ >T ′′, we also have f (x + w(T ))− f (x)−∇ f (x) T w(T )≤ f (w(T ∗ ))− f (x)−∇ f (x) T w(T ∗ ) ≤f (w(T ′′))− f (x)−∇ f...

work page

[1] [1]

Learning the parts of objects by non-negative matrix factorization,

D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999

work page 1999

[2] [2]

Deep learning without poor local minima,

K. Kawaguchi, “Deep learning without poor local minima, ” in Proceedings of Neural Information Pro- cessing Systems (NIPS) , pp. 586–594, 2016

work page 2016

[3] [3]

Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,

M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoret ical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory , vol. 65, pp. 742–769, Feb. 2019

work page 2019

[4] [4]

Escaping from saddle points—online stochastic gradient for tensor decomposition,

R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 797–842, 2015

work page 2015

[5] [5]

A Geometric Analysis of Phase Retrieval

J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phas e retrieval,” arXiv:1602.06664 [cs.IT] , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

No spurious local minima in no nconvex low rank problems: A uniﬁed geometric analysis,

R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in no nconvex low rank problems: A uniﬁed geometric analysis,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1233– 1242, 2017

work page 2017

[7] [7]

Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,

C. W. Royer and S. J. Wright, “Complexity analysis of seco nd-order line-search algorithms for smooth nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1448–1477, 2018

work page 2018

[8] [8]

Accele rated methods for nonconvex optimization,

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accele rated methods for nonconvex optimization,” SIAM Journal on Optimization , vol. 28, no. 2, pp. 1751–1772, 2018

work page 2018

[9] [9]

Finding approximate local minima faster than gradient descent,

N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma, “Finding approximate local minima faster than gradient descent,” in Proceedings of Annual ACM Symposium on the Theory of Computin g (STOC), pp. 1195–1199, 2017

work page 2017

[10] [10]

A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,

S. Paternain, A. Mokhtari, and A. Ribeiro, “A newton-ba sed method for nonconvex optimization with fast evasion of saddle points,” SIAM Journal on Optimization , vol. 29, no. 1, pp. 343–368, 2019

work page 2019

[11] [11]

A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,

C. W. Royer, M. O’Neill, and S. J. Wright, “A newton-CG al gorithm with complexity guarantees for smooth unconstrained optimization,” Mathematical Programming, Jan. 2019

work page 2019

[12] [12]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in Proceedings of Annual Conference on Learning Theory (COLT) , pp. 1246–1257, 2016

work page 2016

[13] [13]

How to escape saddle points eﬃciently,

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points eﬃciently,” in Proceedings of International Conference on Machine Learning ( ICML), pp. 1724–1732, 2017

work page 2017

[14] [14]

First-order stochastic alg orithms for escaping from saddle points in almost linear time,

Y. Xu, J. Rong, and T. Yang, “First-order stochastic alg orithms for escaping from saddle points in almost linear time,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 5535–5545, 2018

work page 2018

[15] [15]

Neon2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y. Li, “Neon2: Finding local minima via ﬁrst-order oracles,” in Proceedings of Neural Information Processing Systems (NeurIPS) , pp. 3720–3730, 2018

work page 2018

[16] [16]

PA-GD: On the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex o ptimization,

S. Lu, M. Hong, and Z. Wang, “PA-GD: On the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex o ptimization,” in Proceedings of International Conference on Machine Learning (ICML) , pp. 4134–4143, 2019

work page 2019

[17] [17]

Using negati ve curvature in solving nonlinear programs,

D. Goldfarb, C. Mu, J. Wright, and C. Zhou, “Using negati ve curvature in solving nonlinear programs,” Computational Optimization and Applications , vol. 68, pp. 479–502, Dec. 2017. 17

work page 2017

[18] [18]

Gradient primal- dual algorithm converges to second-order stationary solutions for nonconvex distributed optimizat ion,

M. Hong, J. D. Lee, and M. Razaviyayn, “Gradient primal- dual algorithm converges to second-order stationary solutions for nonconvex distributed optimizat ion,” in Proceedings of International Conference on Machine Learning (ICML) , 2018

work page 2018

[19] [19]

Learning understandabl e neural networks with nonnegative weight constraints,

J. Chorowski and J. M. Zurada, “Learning understandabl e neural networks with nonnegative weight constraints,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, pp. 62–69, Jan. 2015

work page 2015

[20] [20]

Tensor decomposition for signal processing and machine learning,

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E . Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning, ” IEEE Transactions on Signal Processing , vol. 65, no. 13, pp. 3551–3582, 2017

work page 2017

[21] [21]

On nonconvex quadratic pr ogramming with box constraints,

S. Burer and A. N. Letchford, “On nonconvex quadratic pr ogramming with box constraints,” SIAM Journal on Optimization , vol. 20, no. 2, pp. 1073–1089, 2009

work page 2009

[22] [22]

Second-order negative-curvature methods for box-constrained and general constrained optim ization,

R. Andreani, E. G. Birgin, J. M. Martínez, and M. L. Schuv erdt, “Second-order negative-curvature methods for box-constrained and general constrained optim ization,” Computational Optimization and Applications, vol. 45, no. 2, pp. 209–236, 2010

work page 2010

[23] [23]

A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust region methods. SIAM, 2000

work page 2000

[24] [24]

Second-order op timality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear o ptimization,

C. Cartis, N. I. Gould, and P. L. Toint, “Second-order op timality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear o ptimization,” Foundations of Computational Mathematics, vol. 18, no. 5, pp. 1073–1107, 2018

work page 2018

[25] [25]

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient langevin dynamics: A nonasymptotic analysis,” arXiv preprint arXiv:1702.03849 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Convergence to se cond-order stationary points of a primal- dual algorithm model for nonlinear programming,

G. Di Pillo, S. Lucidi, and L. Palagi, “Convergence to se cond-order stationary points of a primal- dual algorithm model for nonlinear programming,” Mathematics of Operations Research , vol. 30, no. 4, pp. 897–915, 2005

work page 2005

[27] [27]

Convergence to second-order stationarity for constrained non-convex optimization,

M. Nouiehed, J. D. Lee, and M. Razaviyayn, “Convergence to second-order stationarity for constrained non-convex optimization,” arXiv preprint arXiv:1810.02024 , 2018

work page arXiv 2018

[28] [28]

Escaping s addle points in constrained optimization,

A. Mokhtari, A. Ozdaglar, and A. Jadbabaie, “Escaping s addle points in constrained optimization,” in Proceedings of Neural Information Processing Systems (NeurIPS ), pp. 3629–3639, 2018

work page 2018

[29] [29]

Information-based complexity of conv ex programming,

A. Nemirovski, “Information-based complexity of conv ex programming,” Lecture Notes, 1995

work page 1995

[30] [30]

D. P. Bertsekas, Nonlinear Programming, 2nd ed . Belmont, MA: Athena Scientiﬁc, 1999

work page 1999

[31] [31]

Some np-complete problems in quadratic and nonlinear programming,

K. G. Murty and S. N. Kabadi, “Some np-complete problems in quadratic and nonlinear programming,” Mathematical programming, vol. 39, no. 2, pp. 117–129, 1987

work page 1987

[32] [32]

Global conver gence of a class of trust region algorithms for optimization with simple bounds,

A. R. Conn, N. I. M. Gould, and P. L. Toint, “Global conver gence of a class of trust region algorithms for optimization with simple bounds,” SIAM Journal on Numerical Analysis , vol. 25, no. 2, pp. 433–460, 1989

work page 1989

[33] [33]

Newton’s method for large boun d-constrained optimization problems,

C.-J. Lin and J. J. Moré, “Newton’s method for large boun d-constrained optimization problems,” SIAM Journal on Optimization , vol. 9, no. 4, pp. 1100–1127, 1999

work page 1999

[34] [34]

Convergence of trust region algorithm s for optimization with bounds when strict com- plementarity does not hold,

M. Lescrenier, “Convergence of trust region algorithm s for optimization with bounds when strict com- plementarity does not hold,” SIAM Journal on Numerical Analysis , vol. 28, no. 2, pp. 476–495, 1991

work page 1991

[35] [35]

A proximal alternating directi on method of multiplier for linearly constrained nonconvex minimization,

J. Zhang and Z.-Q. Luo, “A proximal alternating directi on method of multiplier for linearly constrained nonconvex minimization,” arXiv preprint arXiv:1812.10229v2 , 2019. 18

work page arXiv 2019

[36] [36]

D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods . Academic Press, 2014

work page 2014

[37] [37]

Two-metric projection methods for constrained optimization,

E. M. Gafni and D. P. Bertsekas, “Two-metric projection methods for constrained optimization,” SIAM Journal on Control and Optimization , vol. 22, no. 6, pp. 936–964, 1984

work page 1984

[38] [38]

On the accurat e identiﬁcation of active constraints,

F. Facchinei, A. Fischer, and C. Kanzow, “On the accurat e identiﬁcation of active constraints,” SIAM Journal on Optimization , vol. 9, no. 1, pp. 14–32, 1998

work page 1998

[39] [39]

Convergence to second orde r stationary points in inequality constrained optimization,

F. Facchinei and S. Lucidi, “Convergence to second orde r stationary points in inequality constrained optimization,” Mathematics of Operations Research , vol. 23, no. 3, pp. 746–766, 1998

work page 1998

[40] [40]

A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Non-Convex Optimization

M. Nouiehed and M. Razaviyayn, “A trust region method fo r ﬁnding second-order stationarity in linearly constrained non-convex optimization,” arXiv preprint arXiv:1904.06784 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[41] [41]

A database for handwritten text recognitio n research,

J. J. Hull, “A database for handwritten text recognitio n research,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 16, no. 5, pp. 550–554, 1994

work page 1994

[42] [42]

Clustering by ort hogonal non-negative matrix factorization: A sequential non-convex penalty approach,

S. Wang, T. Chang, Y. Cui, and J. Pang, “Clustering by ort hogonal non-negative matrix factorization: A sequential non-convex penalty approach,” in Proceedings of IEEE International Conference on Acoustics Speech and Signal Process. (ICASSP) , pp. 5576–5580, May 2019. 19 A Proofs Related to Stationary Points A.1 Proof of Proposition 2 Proof. When f (x) = ...

work page 2019

[43] [43]

(A′(x(r))d(r))i≤ 0: any α> 0 can satisfy (63)

work page

[44] [44]

That is, going along the current direction far enough will eventually reach the boundary of t he feasible set

(A′(x(r))d(r))i > 0: we need α≤ (b′− A′(x(r))x(r))i/ (A′(x(r))d(r))i. That is, going along the current direction far enough will eventually reach the boundary of t he feasible set. Then it follows if there exists a ﬁnite step-size α so that (A′(x(r))(x(r) +α d(r)))i = b′ i, (64) we can easily compute α (r) max in the closed-form by (26). 25 C Proofs of SN...

work page

[45] [45]

This completes the proof

since in line 8 of Algorithm 1 we know from the algorithm that−qπ (x(r)) is chosen when −qπ (x(r)) T v(x(r))L1ǫ′ H (δ) L2 + 63L1ǫ′3 H (δ) 128L2 2 ≤∥ qπ (x(r))∥2, q π (x(r)) T v(x(r))≤ 0. This completes the proof. Lemma 6. If d(r) is chosen by v(x(r)), x(r+1) is computed by the NCD procedure in Algorithm 1 and α (r) max is not selected, then the line searc...

work page

[46] [46]

06ǫ′3 H(δ) L2 2 , ǫ2 G 18L1 } 1 min{d,m}, (84) we obtain r≤ f (x(1))− f ⋆ ∆′ . (85) Since the probability that eigen-pair fails to extract the n egative curvature isδ, applying the union bound, we only need to set δ′ =δ(f (x(1)− f ⋆)/ ∆′ so that we can have the claim that SNAP will output approximat e SOSP1s with probability 1− δ′. Note that γǫ′ H(δ)>ǫ H....

work page

[47] [47]

Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0

or the PGD step, otherwise the algorithm will stop and output an (ǫG,ǫ H )-SOSP1 solution. Then we must have f (x(r))− f (x(r− min{d,m }))≤− min { ǫ2 G/ (18L1), 0. 06ǫ′3 H(δ)/L 2 2 } ,∀r> min{d,m}. (90) Summarizing the argument so far in the second case, we have th at, after every consecutive min{d,m} iterations of the algorithm, either the algorithms sto...

work page

[48] [48]

∥gπ (x(r))∥≥ ǫG: descent per-iteration is ǫ2 G 18L1 by Lemma 4

work page

[49] [49]

06ǫ′3 H(δ)/ (rthL2

∥gπ (x(r))∥≤ ǫG 32 • ﬂagα =∅ and r− rlast <r th: descent per-iteration: 0. 06ǫ′3 H(δ)/ (rthL2

work page

[50] [50]

• ﬂagα = ♦ or r− rlast≥ rth: – ﬂag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1

by (89). • ﬂagα = ♦ or r− rlast≥ rth: – ﬂag =∅: This case means that x(r) is an (ǫG,ǫ H)-SOSP1. We output x(r). – ﬂag = ♦ (there is a negative curvature): ∗ ﬂagα =∅ (boundary not touched): descent per-iteration is at least 0. 06 ǫ′3 H(δ) L2 2 by (88). ∗ ﬂagα = ♦ (boundary touched): descent per-iteration min{ ǫ2 G 18L1 , 0. 06 ǫ′3 H (δ) L2 2 }/ min{d,m} by...

work page

[51] [51]

From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf

Case T ′≤ T ∗ : Applying Lemma 9, we know that f (x + u(T ′))− f (x)−∇ f (x) T u(T ′)≤ f (x + u(1))− f (x)−∇ f (x) T u(1)− 2F (a) ≤ L1 2∥u(1)∥2− 2F (b) ≤− 2F (143) 39 where (a) is true because of the L1-gradient Lipschitz continuity; (b) is true because u(1) = 0. From (109), we know that SP-GD is always decreasing the appro ximate objective function ˆf . ...

work page

[52] [52]

Deﬁne T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F }

Case T ′ > T∗: Applying Lemma 9, we know that ∥u(r)− u(1)∥≤ 3S for r < T∗ . Deﬁne T ′′ = inf r≥ 1 { r|f (w(r))− f (w(1))≤− 2F } . Then, after applying Lemma 10, we know T ′′< T∗ . Using the same argument as the above case, for T≥ ˆcT =T ∗ >T ′′, we also have f (x + w(T ))− f (x)−∇ f (x) T w(T )≤ f (w(T ∗ ))− f (x)−∇ f (x) T w(T ∗ ) ≤f (w(T ′′))− f (x)−∇ f...

work page