Distributed Learning in Non-Convex Environments -- Part II: Polynomial Escape from Saddle-Points

Ali H. Sayed; Stefan Vlaski

The diffusion strategy escapes strict saddle-points in O(1/μ) iterations and reaches approximate second-order stationary points in polynomial time under milder gradient noise conditions.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 09:48 UTC pith:UQIJ2CZL

load-bearing objection The paper claims O(1/μ) saddle escape for diffusion strategies under relaxed noise but the bound rests on the short-term model from Part I remaining accurate near saddles. the 1 major comments →

arxiv 1907.01849 v1 pith:UQIJ2CZL submitted 2019-07-03 cs.MA cs.LGeess.SPmath.OC

Distributed Learning in Non-Convex Environments -- Part II: Polynomial Escape from Saddle-Points

Stefan Vlaski , Ali H. Sayed This is my paper

classification cs.MA cs.LGeess.SPmath.OC

keywords diffusion strategydistributed learningnon-convex optimizationsaddle-point escapesecond-order stationary pointsstochastic gradientnetwork centroidfinite-time analysis

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that agents using the diffusion strategy in a distributed network can escape strict saddle-points during non-convex optimization. Building on the earlier result that agents cluster around a network centroid, the authors apply a short-term model to track the centroid's finite-time behavior. They show escape occurs in O(1/μ) steps and that the process reaches points satisfying approximate second-order stationarity after a polynomial number of iterations. The result holds with weaker assumptions on the gradient noise than those required in centralized perturbed or stochastic gradient methods. A reader would care because it indicates that distributed streaming-data learning can navigate non-convex landscapes efficiently without centralized coordination.

Core claim

Using the short-term model for finite-time dynamics, the diffusion strategy is able to escape from strict saddle-points in O(1/μ) iterations; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, the approach requires less restrictive conditions on the gradient noise process.

What carries the argument

The short-term model for the finite-time dynamics of the network centroid, which tracks expected descent in the large-gradient regime and enables analysis of escape behavior.

Load-bearing premise

The short-term model accurately describes the network centroid behavior over finite time horizons.

What would settle it

A direct calculation or simulation of the network centroid trajectory showing that escape from a strict saddle-point requires more than O(1/μ) iterations or stronger noise assumptions than those used in the short-term model.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The diffusion strategy achieves escape from strict saddle-points in linear time relative to the inverse step-size.
Approximate second-order stationary points are reached after a polynomial number of iterations across the network.
The escape result holds under gradient noise conditions that are less restrictive than those needed for centralized stochastic gradient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The clustering around the centroid combined with the escape result implies that the overall network behavior can be reduced to analyzing a single effective trajectory for saddle avoidance.
The milder noise requirements suggest the distributed exchange of iterates may supply sufficient perturbation for escape without added artificial noise.
The polynomial-time guarantee opens the possibility of applying the same short-term modeling technique to other distributed recursions that exhibit similar centroid dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper claims O(1/μ) saddle escape for diffusion strategies under relaxed noise but the bound rests on the short-term model from Part I remaining accurate near saddles.

read the letter

The key point here is that the diffusion strategy is claimed to escape strict saddle-points in O(1/μ) iterations and reach approximate second-order points in polynomial iterations, with less restrictive noise conditions than centralized methods. The paper does a reasonable job of building on Part I by using the established clustering around the centroid and the short-term model to analyze finite-time dynamics in non-convex settings. Applying known centralized ideas to the distributed case is a natural step, and relaxing the noise assumptions could be useful if the proofs hold. The main concern is whether the short-term model from Part I remains valid during the saddle-escape phase. The escape bound is derived from that model applied to the centroid, so any deviation in the actual dynamics near saddles would undermine the O(1/μ) claim and the polynomial return time. The abstract doesn't provide the error bounds or explicit conditions, so it's difficult to judge how tight the analysis is without the full derivations. This work is aimed at people studying distributed stochastic optimization and non-convex learning over networks. Someone already familiar with the diffusion strategy and Part I would get the most out of it. The central argument could hold if the model checks out, but the dependence on the prior part makes it worth checking carefully. I would recommend sending this to peer review so that referees can examine the derivations and the validity of the short-term approximation in the saddle regime.

Referee Report

1 major / 0 minor

Summary. The paper claims that the diffusion strategy escapes strict saddle-points in O(1/μ) iterations and reaches approximate second-order stationary points in a polynomial number of iterations by applying the short-term model for the network centroid (introduced in Part I) to the finite-time dynamics, under less restrictive conditions on the gradient noise process than prior centralized perturbed or stochastic gradient descent analyses.

Significance. If the short-term model remains faithful near saddles, the result would provide a distributed counterpart to centralized polynomial escape-time guarantees with weaker noise assumptions, which is potentially significant for non-convex distributed optimization over networks.

major comments (1)

[Abstract (and reliance on Part I short-term model)] The O(1/μ) escape-time and polynomial second-order stationarity claims rest entirely on the short-term model from Part I accurately describing the centroid trajectory over the relevant finite-time windows near strict saddles (including under the stated noise conditions); no explicit verification, error bounds, or additional conditions for the saddle regime are provided in this manuscript, making the central claims unverifiable from the given material.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the central reliance on the short-term model from Part I. We address the major comment below.

read point-by-point responses

Referee: [Abstract (and reliance on Part I short-term model)] The O(1/μ) escape-time and polynomial second-order stationarity claims rest entirely on the short-term model from Part I accurately describing the centroid trajectory over the relevant finite-time windows near strict saddles (including under the stated noise conditions); no explicit verification, error bounds, or additional conditions for the saddle regime are provided in this manuscript, making the central claims unverifiable from the given material.

Authors: We agree that the O(1/μ) escape-time and polynomial second-order stationarity results in this manuscript are obtained by applying the short-term model for the network centroid that was introduced and analyzed in Part I. That model was derived under the paper's stated gradient noise assumptions for general finite-time windows in non-convex environments; the present analysis invokes it specifically over the short windows in which the centroid trajectory is near a strict saddle. Because the noise conditions and finite-horizon setting are identical to those under which the model was justified in Part I, the same error bounds apply. Nevertheless, the referee is correct that this manuscript does not repeat or re-derive those bounds. In the revised version we will insert a short paragraph (new Section II-B or an expanded introduction) that recalls the key properties and error bounds of the short-term model from Part I and states explicitly why they remain valid in the saddle regime. This will make the central claims verifiable from the material of Part II alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies prior Part I model without reduction to self-inputs

full rationale

The paper's escape-time claims are derived by applying the short-term model and clustering results from the authors' prior Part I work. The abstract explicitly states the new results are obtained 'using this model,' but provides no equations or steps showing that the O(1/μ) bound or polynomial stationarity return is equivalent by construction to quantities already defined or fitted in Part II itself. Self-citation to Part I is standard and does not meet the criteria for load-bearing circularity, as the cited model is presented as an independent prior derivation rather than a tautology or fitted prediction renamed within this manuscript. No self-definitional, fitted-input, or ansatz-smuggling patterns are exhibited by direct quotes from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5679 in / 1149 out tokens · 23539 ms · 2026-05-25T09:48:56.950886+00:00 · methodology

0 comments

read the original abstract

The diffusion strategy for distributed learning from streaming data employs local stochastic gradient updates along with exchange of iterates over neighborhoods. In Part I [2] of this work we established that agents cluster around a network centroid and proceeded to study the dynamics of this point. We established expected descent in non-convex environments in the large-gradient regime and introduced a short-term model to examine the dynamics over finite-time horizons. Using this model, we establish in this work that the diffusion strategy is able to escape from strict saddle-points in O(1/$\mu$) iterations; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, our approach requires less restrictive conditions on the gradient noise process.

Figures

Figures reproduced from arXiv: 1907.01849 by Ali H. Sayed, Stefan Vlaski.

**Figure 2.** Figure 2: Cost surface of a simple neural network with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Agents are initialized together precisely in the str [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

[1]

Diffusion learning in non-con vex envi- ronments,

S. Vlaski and A. H. Sayed, “Diffusion learning in non-con vex envi- ronments,” in Proc. of IEEE ICASSP , Brighton, UK, May 2019, pp. 5262–5266

work page 2019
[2]

Distributed learning in non-c onvex environments – Part I: Agreement at a Linear rate,

S. Vlaski and A. H. Sayed, “Distributed learning in non-c onvex environments – Part I: Agreement at a Linear rate,” submitted for publication, see also arXiv version , July 2019

work page 2019
[3]

Distributed subgradient meth ods for multi- agent optimization,

A. Nedic and A. Ozdaglar, “Distributed subgradient meth ods for multi- agent optimization,” IEEE Trans. Automatic Control , vol. 54, no. 1, pp. 48–61, Jan 2009

work page 2009
[4]

Adaptation, learning, and optimization ov er networks,

A. H. Sayed, “Adaptation, learning, and optimization ov er networks,” F oundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311– 801, July 2014

work page 2014
[5]

On the learning behavior of adapt ive networks - Part I: Transient analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adapt ive networks - Part I: Transient analysis,” IEEE Transactions on Information Theory , vol. 61, no. 6, pp. 3487–3517, June 2015

work page 2015
[6]

Distributed stochastic optimization with gradient tracking over strongly-connected networks

R. Xin, A. K. Sahu, U. A. Khan, S. Kar, “Distributed stocha stic optimization with gradient tracking over strongly-connec ted networks,” available as arXiv:1903.07266 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[7]

Stochastic g radient descent with ﬁnite samples sizes,

K. Y uan, B. Ying, S. Vlaski, and A. H. Sayed, “Stochastic g radient descent with ﬁnite samples sizes,” in Proc. of IEEE MLSP , Vietri sul Mare, Italy, Sep. 2016, pp. 1–6

work page 2016
[8]

Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015
[9]

Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,

K. Y uan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,” IEEE Transactions on Signal Processing , vol. 67, no. 3, pp. 724–739, Feb 2019

work page 2019
[10]

Diffusion strategies o utperform consensus strategies for distributed estimation over adap tive networks,

Sheng-Y uan Tu and Ali H. Sayed, “Diffusion strategies o utperform consensus strategies for distributed estimation over adap tive networks,” Trans. Sig. Proc. , vol. 60, no. 12, pp. 6217–6234, Dec. 2012

work page 2012
[11]

On the learning behavior of adap tive networks – Part II: Performance analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adap tive networks – Part II: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, June 2015

work page 2015
[12]

Prox imal multitask learning over networks with sparsity-inducing coregulari zation,

R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Prox imal multitask learning over networks with sparsity-inducing coregulari zation,” IEEE Transactions on Signal Processing , vol. 64, no. 23, pp. 6329–6344, Dec 2016

work page 2016
[13]

Adaptive penalty-based dis tributed stochastic convex optimization,

Z. J. Towﬁc and A. H. Sayed, “Adaptive penalty-based dis tributed stochastic convex optimization,” IEEE Trans. on Signal Process. , vol. 62, no. 15, pp. 3924–3938, Aug. 2014

work page 2014
[14]

Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,

B. Ying and A. H. Sayed, “Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,” Signal Processing, vol. 144, pp. 253 – 264, 2018

work page 2018
[15]

Cubic regularization of n ewton method and its global performance,

Y . Nesterov and B.T. Polyak, “Cubic regularization of n ewton method and its global performance,” Mathematical Programming, vol. 108, no. 1, pp. 177–205, Aug 2006

work page 2006
[16]

SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,

C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,” in Proc. of NIPS , pp. 689–699. Montreal, Canada, 2018

work page 2018
[17]

NEON2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y . Li, “NEON2: Finding local minima via ﬁrst-order oracles,” in Proc. of NIPS , pp. 3716–3726. Montreal, Canada, Dec. 2018

work page 2018
[18]

Natasha 2: Faster non-convex optimizat ion than SGD,

Z. Allen-Zhu, “Natasha 2: Faster non-convex optimizat ion than SGD,” in Proc. of NIPS , pp. 2675–2686. Montreal, Canada, Dec. 2018

work page 2018
[19]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in 29th Annual Conference on Learning Theory, New Y ork, 2016, pp. 1246–1257

work page 2016
[20]

Second-o rder guaran- tees of distributed gradient algorithms,

A. Daneshmand, G. Scutari and V . Kungurtsev, “Second-o rder guaran- tees of distributed gradient algorithms,” available as arXiv:1809.08694 , Sep. 2018

work page arXiv 2018
[21]

Recursive stochastic algori thms for global optimization in Rd,

S. Gelfand and S. Mitter, “Recursive stochastic algori thms for global optimization in Rd,” SIAM Journal on Control and Optimization , vol. 29, no. 5, pp. 999–1018, 1991

work page 1991
[22]

Annealing for Distributed Global Optimization

B. Swenson, S. Kar, H. V . Poor and J. M. F. Moura, “Anneali ng for distributed global optimization,” available as arXiv:1903.07258 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[23]

Gradient Descent Can Take Exponential Time to Escape Saddle Points

S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, B. Poczos and A. Si ngh, “Gradient descent can take exponential time to escape saddl e points,” available as arXiv:1705.10412 , May 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,

R. Ge, F. Huang, C. Jin, and Y . Y uan, “Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,” in Proc. of Conference on Learning Theory , Paris, France, 2015, pp. 797–842

work page 2015
[25]

How to escape saddle points efﬁciently,

C. Jin, R. Ge, P . Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points efﬁciently,” in Proc. of ICML , Sydney, Australia, Aug. 2017, pp. 1724–1732

work page 2017
[26]

M.andJordan, M

C. Jin, P . Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan , “Stochas- tic gradient descent escapes saddle points efﬁciently,” available as arXiv:1902.04811, Feb. 2019

work page arXiv 1902
[27]

Escaping Saddles with Stochastic Gradients

H. Daneshmand, J. Kohler, A. Lucchi and T. Hofmann, “Esc aping saddles with stochastic gradients,” available as arXiv:1803.05999 , March 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

C. Fang, Z. Lin and T. Zhang, “Sharp analysis for nonconv ex sgd escaping from saddle points,” available as arXiv:1902.00247, Feb. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[29]

NEXT: in-network nonconv ex optimiza- tion,

P . Di Lorenzo and G. Scutari, “NEXT: in-network nonconv ex optimiza- tion,” IEEE Transactions on Signal and Information Processing ove r Networks, vol. 2, no. 2, pp. 120–136, June 2016

work page 2016
[30]

Global convergence of ADMM in nonconvex nonsmooth optimization,

Y . Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” Journal of Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, Jan. 2019

work page 2019
[31]

Non-convex distributed opt imization,

T. Tatarenko and B. Touri, “Non-convex distributed opt imization,” IEEE Transactions on Automatic Control , vol. 62, no. 8, pp. 3744–3757, Aug. 2017

work page 2017
[32]

R. A. Horn and C. R. Johnson, Matrix Analysis , Cambridge University Press, 2003

work page 2003
[33]

The Perron-Frobenius theorem: Some of its applications,

S. U. Pillai, T. Suel, and S. Cha, “The Perron-Frobenius theorem: Some of its applications,” IEEE Signal Processing Magazine , vol. 22, no. 2, pp. 62–75, March 2005

work page 2005

[1] [1]

Diffusion learning in non-con vex envi- ronments,

S. Vlaski and A. H. Sayed, “Diffusion learning in non-con vex envi- ronments,” in Proc. of IEEE ICASSP , Brighton, UK, May 2019, pp. 5262–5266

work page 2019

[2] [2]

Distributed learning in non-c onvex environments – Part I: Agreement at a Linear rate,

S. Vlaski and A. H. Sayed, “Distributed learning in non-c onvex environments – Part I: Agreement at a Linear rate,” submitted for publication, see also arXiv version , July 2019

work page 2019

[3] [3]

Distributed subgradient meth ods for multi- agent optimization,

A. Nedic and A. Ozdaglar, “Distributed subgradient meth ods for multi- agent optimization,” IEEE Trans. Automatic Control , vol. 54, no. 1, pp. 48–61, Jan 2009

work page 2009

[4] [4]

Adaptation, learning, and optimization ov er networks,

A. H. Sayed, “Adaptation, learning, and optimization ov er networks,” F oundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311– 801, July 2014

work page 2014

[5] [5]

On the learning behavior of adapt ive networks - Part I: Transient analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adapt ive networks - Part I: Transient analysis,” IEEE Transactions on Information Theory , vol. 61, no. 6, pp. 3487–3517, June 2015

work page 2015

[6] [6]

Distributed stochastic optimization with gradient tracking over strongly-connected networks

R. Xin, A. K. Sahu, U. A. Khan, S. Kar, “Distributed stocha stic optimization with gradient tracking over strongly-connec ted networks,” available as arXiv:1903.07266 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[7] [7]

Stochastic g radient descent with ﬁnite samples sizes,

K. Y uan, B. Ying, S. Vlaski, and A. H. Sayed, “Stochastic g radient descent with ﬁnite samples sizes,” in Proc. of IEEE MLSP , Vietri sul Mare, Italy, Sep. 2016, pp. 1–6

work page 2016

[8] [8]

Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015

[9] [9]

Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,

K. Y uan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,” IEEE Transactions on Signal Processing , vol. 67, no. 3, pp. 724–739, Feb 2019

work page 2019

[10] [10]

Diffusion strategies o utperform consensus strategies for distributed estimation over adap tive networks,

Sheng-Y uan Tu and Ali H. Sayed, “Diffusion strategies o utperform consensus strategies for distributed estimation over adap tive networks,” Trans. Sig. Proc. , vol. 60, no. 12, pp. 6217–6234, Dec. 2012

work page 2012

[11] [11]

On the learning behavior of adap tive networks – Part II: Performance analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adap tive networks – Part II: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, June 2015

work page 2015

[12] [12]

Prox imal multitask learning over networks with sparsity-inducing coregulari zation,

R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Prox imal multitask learning over networks with sparsity-inducing coregulari zation,” IEEE Transactions on Signal Processing , vol. 64, no. 23, pp. 6329–6344, Dec 2016

work page 2016

[13] [13]

Adaptive penalty-based dis tributed stochastic convex optimization,

Z. J. Towﬁc and A. H. Sayed, “Adaptive penalty-based dis tributed stochastic convex optimization,” IEEE Trans. on Signal Process. , vol. 62, no. 15, pp. 3924–3938, Aug. 2014

work page 2014

[14] [14]

Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,

B. Ying and A. H. Sayed, “Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,” Signal Processing, vol. 144, pp. 253 – 264, 2018

work page 2018

[15] [15]

Cubic regularization of n ewton method and its global performance,

Y . Nesterov and B.T. Polyak, “Cubic regularization of n ewton method and its global performance,” Mathematical Programming, vol. 108, no. 1, pp. 177–205, Aug 2006

work page 2006

[16] [16]

SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,

C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,” in Proc. of NIPS , pp. 689–699. Montreal, Canada, 2018

work page 2018

[17] [17]

NEON2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y . Li, “NEON2: Finding local minima via ﬁrst-order oracles,” in Proc. of NIPS , pp. 3716–3726. Montreal, Canada, Dec. 2018

work page 2018

[18] [18]

Natasha 2: Faster non-convex optimizat ion than SGD,

Z. Allen-Zhu, “Natasha 2: Faster non-convex optimizat ion than SGD,” in Proc. of NIPS , pp. 2675–2686. Montreal, Canada, Dec. 2018

work page 2018

[19] [19]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in 29th Annual Conference on Learning Theory, New Y ork, 2016, pp. 1246–1257

work page 2016

[20] [20]

Second-o rder guaran- tees of distributed gradient algorithms,

A. Daneshmand, G. Scutari and V . Kungurtsev, “Second-o rder guaran- tees of distributed gradient algorithms,” available as arXiv:1809.08694 , Sep. 2018

work page arXiv 2018

[21] [21]

Recursive stochastic algori thms for global optimization in Rd,

S. Gelfand and S. Mitter, “Recursive stochastic algori thms for global optimization in Rd,” SIAM Journal on Control and Optimization , vol. 29, no. 5, pp. 999–1018, 1991

work page 1991

[22] [22]

Annealing for Distributed Global Optimization

B. Swenson, S. Kar, H. V . Poor and J. M. F. Moura, “Anneali ng for distributed global optimization,” available as arXiv:1903.07258 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[23] [23]

Gradient Descent Can Take Exponential Time to Escape Saddle Points

S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, B. Poczos and A. Si ngh, “Gradient descent can take exponential time to escape saddl e points,” available as arXiv:1705.10412 , May 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,

R. Ge, F. Huang, C. Jin, and Y . Y uan, “Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,” in Proc. of Conference on Learning Theory , Paris, France, 2015, pp. 797–842

work page 2015

[25] [25]

How to escape saddle points efﬁciently,

C. Jin, R. Ge, P . Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points efﬁciently,” in Proc. of ICML , Sydney, Australia, Aug. 2017, pp. 1724–1732

work page 2017

[26] [26]

M.andJordan, M

C. Jin, P . Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan , “Stochas- tic gradient descent escapes saddle points efﬁciently,” available as arXiv:1902.04811, Feb. 2019

work page arXiv 1902

[27] [27]

Escaping Saddles with Stochastic Gradients

H. Daneshmand, J. Kohler, A. Lucchi and T. Hofmann, “Esc aping saddles with stochastic gradients,” available as arXiv:1803.05999 , March 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

C. Fang, Z. Lin and T. Zhang, “Sharp analysis for nonconv ex sgd escaping from saddle points,” available as arXiv:1902.00247, Feb. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[29] [29]

NEXT: in-network nonconv ex optimiza- tion,

P . Di Lorenzo and G. Scutari, “NEXT: in-network nonconv ex optimiza- tion,” IEEE Transactions on Signal and Information Processing ove r Networks, vol. 2, no. 2, pp. 120–136, June 2016

work page 2016

[30] [30]

Global convergence of ADMM in nonconvex nonsmooth optimization,

Y . Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” Journal of Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, Jan. 2019

work page 2019

[31] [31]

Non-convex distributed opt imization,

T. Tatarenko and B. Touri, “Non-convex distributed opt imization,” IEEE Transactions on Automatic Control , vol. 62, no. 8, pp. 3744–3757, Aug. 2017

work page 2017

[32] [32]

R. A. Horn and C. R. Johnson, Matrix Analysis , Cambridge University Press, 2003

work page 2003

[33] [33]

The Perron-Frobenius theorem: Some of its applications,

S. U. Pillai, T. Suel, and S. Cha, “The Perron-Frobenius theorem: Some of its applications,” IEEE Signal Processing Magazine , vol. 22, no. 2, pp. 62–75, March 2005

work page 2005