Distributed Learning in Non-Convex Environments -- Part I: Agreement at a Linear Rate

Ali H. Sayed; Stefan Vlaski

arxiv: 1907.01848 · v1 · pith:GHXNPK2Jnew · submitted 2019-07-03 · 🧮 math.OC · cs.MA· eess.SP

Distributed Learning in Non-Convex Environments -- Part I: Agreement at a Linear Rate

Stefan Vlaski , Ali H. Sayed This is my paper

Pith reviewed 2026-05-25 10:21 UTC · model grok-4.3

classification 🧮 math.OC cs.MAeess.SP

keywords distributed optimizationnon-convex optimizationstochastic gradientsdiffusion learningmulti-agent systemsnetwork agreementlinear convergence rate

0 comments

The pith

In stochastic non-convex distributed optimization, diffusion strategies drive agent iterates to cluster around the network centroid at a linear rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies diffusion learning for distributed optimization when cost functions are non-convex and only stochastic gradient approximations are available. It establishes that the persistent gradient noise does not prevent the agents from reaching agreement: their iterates gather inside a small ball around the network centroid. The clustering occurs at a linear rate, which supplies a short-term model for finite-horizon network behavior. This agreement result is then used in the companion Part II to analyze escape from saddle points. A sympathetic reader cares because many practical learning tasks are non-convex and must run on networks where only noisy local gradients can be computed.

Core claim

The diffusion learning strategy continues to yield meaningful estimates in non-convex scenarios in the sense that the iterates by the individual agents will cluster in a small region around the network centroid.

What carries the argument

The diffusion learning strategy, which performs local stochastic gradient steps followed by convex combination of neighbor estimates, producing agreement around the network centroid.

If this is right

The observed clustering supplies a short-term model for network evolution over any finite horizon.
This model can be leveraged to prove descent through saddle points in O(1/mu) steps.
Approximately second-order stationary points are reached after a polynomial number of iterations.
The same clustering mechanism remains valid when the underlying costs are non-convex.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear clustering rate may allow similar agreement results for other combination policies beyond diffusion.
The finite-horizon model could be tested on concrete non-convex problems such as training shallow neural networks over sensor networks.
If the clustering radius scales with the gradient noise variance, then reducing local step-size or increasing communication frequency would tighten the ball.

Load-bearing premise

The diffusion learning strategy continues to yield meaningful estimates in non-convex scenarios even though exact gradients are replaced by stochastic approximations whose noise persistently affects the dynamics.

What would settle it

Numerical runs in which the distance of individual iterates from the network centroid remains larger than the predicted small radius after a number of steps proportional to log(1/epsilon) would falsify the linear-rate clustering claim.

Figures

Figures reproduced from arXiv: 1907.01848 by Ali H. Sayed, Stefan Vlaski.

**Figure 2.** Figure 2: Graph with N = 20 nodes (left) and regressor power Tr Rh,k at each agent (right). 000 0 00 0 00 0 00 ! ! 0 % 0 % 0 %0 %0 %0 %0 %0 0 0 !# " !# " " " !#"$ " "$ 000 0 00 0 00 0 00 0 $0 $0 $0 $0 $0 0 0 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Performance in the nominal (left) and corrupted case [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Driven by the need to solve increasingly complex optimization problems in signal processing and machine learning, there has been increasing interest in understanding the behavior of gradient-descent algorithms in non-convex environments. Most available works on distributed non-convex optimization problems focus on the deterministic setting where exact gradients are available at each agent. In this work and its Part II, we consider stochastic cost functions, where exact gradients are replaced by stochastic approximations and the resulting gradient noise persistently seeps into the dynamics of the algorithm. We establish that the diffusion learning strategy continues to yield meaningful estimates non-convex scenarios in the sense that the iterates by the individual agents will cluster in a small region around the network centroid. We use this insight to motivate a short-term model for network evolution over a finite-horizon. In Part II [2] of this work, we leverage this model to establish descent of the diffusion strategy through saddle points in O(1/$\mu$) steps and the return of approximately second-order stationary points in a polynomial number of iterations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows diffusion iterates cluster around the centroid at linear rate in the stochastic non-convex case, which cleanly separates consensus from optimization.

read the letter

The main takeaway is that the diffusion strategy still produces linear-rate agreement among agents even when stochastic gradient noise is present and the costs are non-convex. The iterates cluster in a small ball around the network centroid, and this holds without the noise destroying the contraction from the combination matrix spectral gap. That separation is the useful piece for the short-term model they introduce here and then use in Part II for saddle escape analysis. The extension from their prior deterministic or convex results is incremental but fills a needed gap for stochastic non-convex distributed work. The argument follows the usual perturbation approach: the disagreement recursion contracts for small enough step-size mu, with the additive noise term only setting the ultimate cluster radius rather than the rate itself. This structure is consistent and avoids circularity. One minor limitation is that the cluster size scales with the noise variance and mu, so it can be larger than desired in high-variance regimes, but that is expected and does not undermine the linear-rate claim. The assumptions on Lipschitz gradients and bounded noise moments are standard for this line of work. This paper is aimed at researchers who need rigorous building blocks for convergence theory in distributed non-convex optimization and large-scale machine learning. A reader working on multi-agent training or diffusion-based methods would find the clustering result and the finite-horizon model directly usable. It deserves peer review because the core technical step is solid and relevant to an active area.

Referee Report

0 major / 2 minor

Summary. This paper (Part I) analyzes diffusion strategies for distributed stochastic optimization over networks in non-convex settings. It claims that, despite persistent gradient noise from stochastic approximations, the individual agent iterates cluster in a small region around the network centroid at a linear rate. The clustering result is used to motivate a short-term model for finite-horizon network evolution, which Part II leverages to analyze descent through saddle points in O(1/μ) steps and return of approximate second-order stationary points.

Significance. If the linear-rate clustering holds under standard assumptions on the combination matrix and step-size, the result is significant because it cleanly separates consensus dynamics (governed by the spectral gap) from optimization dynamics even with additive stochastic driving terms. This decoupling is a standard but useful technical step that enables the non-convex analysis promised in Part II; the linear contraction rate itself is noteworthy as it is shown to be unaffected by convexity or the noise variance (which only influences cluster radius).

minor comments (2)

[Abstract] Abstract, paragraph 3: the phrase 'the diffusion learning strategy continues to yield meaningful estimates' is vague; replace with a precise statement of the clustering radius (e.g., O(μ) or O(√μ) in mean-square) and the linear rate constant.
The short-term model is motivated from the clustering result, but the transition from infinite-horizon clustering to the finite-horizon model should be stated more explicitly (e.g., which moments or bounds are retained).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition of the significance of the linear-rate clustering result, and the recommendation for minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation establishes linear-rate clustering of diffusion iterates around the network centroid in non-convex stochastic settings by separating consensus contraction (via spectral gap of the combination matrix) from optimization dynamics, with noise affecting only cluster radius. This separation is standard and independent of the target result; the short-term model is motivated by the clustering claim but does not reduce to it by construction or via self-citation chains. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The paper is self-contained against external benchmarks for this part.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5713 in / 904 out tokens · 31896 ms · 2026-05-25T10:21:50.094796+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[1]

Diffusion learning in non-con vex envi- ronments,

S. Vlaski and A. H. Sayed, “Diffusion learning in non-con vex envi- ronments,” in Proc. of IEEE ICASSP , Brighton, UK, May 2019, pp. 5262–5266

work page 2019
[2]

Distributed learning in non-c onvex envi- ronments – Part II: Polynomial escape from saddle-points,

S. Vlaski and A. H. Sayed, “Distributed learning in non-c onvex envi- ronments – Part II: Polynomial escape from saddle-points,” submitted for publication, see also arXiv version , July 2019

work page 2019
[3]

Distributed subgradient meth ods for multi- agent optimization,

A. Nedic and A. Ozdaglar, “Distributed subgradient meth ods for multi- agent optimization,” IEEE Trans. Automatic Control , vol. 54, no. 1, pp. 48–61, Jan 2009

work page 2009
[4]

Adaptation, learning, and optimization ov er networks,

A. H. Sayed, “Adaptation, learning, and optimization ov er networks,” F oundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311– 801, July 2014

work page 2014
[5]

On the learning behavior of adapt ive networks - Part I: Transient analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adapt ive networks - Part I: Transient analysis,” IEEE Transactions on Information Theory , vol. 61, no. 6, pp. 3487–3517, June 2015

work page 2015
[6]

Distributed stochastic optimization with gradient tracking over strongly-connected networks

R. Xin, A. K. Sahu, U. A. Khan, S. Kar, “Distributed stocha stic optimization with gradient tracking over strongly-connec ted networks,” available as arXiv:1903.07266 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[7]

Stochastic g radient descent with ﬁnite samples sizes,

K. Y uan, B. Ying, S. Vlaski, and A. H. Sayed, “Stochastic g radient descent with ﬁnite samples sizes,” in Proc. of IEEE MLSP , Vietri sul Mare, Italy, Sep. 2016, pp. 1–6

work page 2016
[8]

Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015
[9]

Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,

K. Y uan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,” IEEE Transactions on Signal Processing , vol. 67, no. 3, pp. 724–739, Feb 2019

work page 2019
[10]

Adaptive networks,

A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE , vol. 102, no. 4, pp. 460–497, April 2014

work page 2014
[11]

R. A. Horn and C. R. Johnson, Matrix Analysis , Cambridge University Press, 2003

work page 2003
[12]

The Perron-Frobenius theorem: Some of its applications,

S. U. Pillai, T. Suel, and S. Cha, “The Perron-Frobenius theorem: Some of its applications,” IEEE Signal Processing Magazine , vol. 22, no. 2, pp. 62–75, March 2005

work page 2005
[13]

Prox imal multitask learning over networks with sparsity-inducing coregulari zation,

R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Prox imal multitask learning over networks with sparsity-inducing coregulari zation,” IEEE Transactions on Signal Processing , vol. 64, no. 23, pp. 6329–6344, Dec 2016

work page 2016
[14]

Adaptive penalty-based dis tributed stochastic convex optimization,

Z. J. Towﬁc and A. H. Sayed, “Adaptive penalty-based dis tributed stochastic convex optimization,” IEEE Trans. on Signal Process. , vol. 62, no. 15, pp. 3924–3938, Aug. 2014

work page 2014
[15]

Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,

B. Ying and A. H. Sayed, “Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,” Signal Processing, vol. 144, pp. 253 – 264, 2018

work page 2018
[16]

Sup- port vector machines,

M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scho lkopf, “Sup- port vector machines,” IEEE Intelligent Systems and their Applications , vol. 13, no. 4, pp. 18–28, July 1998

work page 1998
[17]

A. M. Zoubir, V . Koivunen, E. Ollila, and M. Muma, Robust Statistics for Signal Processing , Cambridge University Press, 2018

work page 2018
[18]

Dictionary learning,

I. Tosic and P . Frossard, “Dictionary learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, March 2011

work page 2011
[19]

The Loss Surfaces of Multilayer Networks,

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . LeCun, “The Loss Surfaces of Multilayer Networks,” in Proceedings of the Eighteenth International Conference on Artiﬁcial Inte lligence and Statistics, San Diego, May 2015, pp. 192–204

work page 2015
[20]

Recursive stochastic algori thms for global optimization in Rd,

S. Gelfand and S. Mitter, “Recursive stochastic algori thms for global optimization in Rd,” SIAM Journal on Control and Optimization , vol. 29, no. 5, pp. 999–1018, 1991

work page 1991
[21]

Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,

R. Ge, F. Huang, C. Jin, and Y . Y uan, “Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,” in Proc. of Conference on Learning Theory , Paris, France, 2015, pp. 797–842

work page 2015
[22]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in 29th Annual Conference on Learning Theory, New Y ork, 2016, pp. 1246–1257

work page 2016
[23]

How to escape saddle points efﬁciently,

C. Jin, R. Ge, P . Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points efﬁciently,” in Proc. of ICML , Sydney, Australia, Aug. 2017, pp. 1724–1732

work page 2017
[24]

Escaping Saddles with Stochastic Gradients

H. Daneshmand, J. Kohler, A. Lucchi and T. Hofmann, “Esc aping saddles with stochastic gradients,” available as arXiv:1803.05999 , March 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,

C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,” in Proc. of NIPS , pp. 689–699. Montreal, Canada, 2018

work page 2018
[26]

Natasha 2: Faster non-convex optimizat ion than SGD,

Z. Allen-Zhu, “Natasha 2: Faster non-convex optimizat ion than SGD,” in Proc. of NIPS , pp. 2675–2686. Montreal, Canada, Dec. 2018

work page 2018
[27]

NEON2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y . Li, “NEON2: Finding local minima via ﬁrst-order oracles,” in Proc. of NIPS , pp. 3716–3726. Montreal, Canada, Dec. 2018

work page 2018
[28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

C. Fang, Z. Lin and T. Zhang, “Sharp analysis for nonconv ex sgd escaping from saddle points,” available as arXiv:1902.00247, Feb. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[29]

Stochas- tic gradient descent escapes saddle points efﬁciently,

C. Jin, P . Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan , “Stochas- tic gradient descent escapes saddle points efﬁciently,” available as arXiv:1902.04811, Feb. 2019

work page arXiv 1902
[30]

NEXT: in-network nonconv ex optimiza- tion,

P . Di Lorenzo and G. Scutari, “NEXT: in-network nonconv ex optimiza- tion,” IEEE Transactions on Signal and Information Processing ove r Networks, vol. 2, no. 2, pp. 120–136, June 2016

work page 2016
[31]

Global convergence of ADMM in nonconvex nonsmooth optimization,

Y . Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” Journal of Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, Jan. 2019

work page 2019
[32]

Non-convex distributed opt imization,

T. Tatarenko and B. Touri, “Non-convex distributed opt imization,” IEEE Transactions on Automatic Control , vol. 62, no. 8, pp. 3744–3757, Aug. 2017

work page 2017
[33]

Second-o rder guaran- tees of distributed gradient algorithms,

A. Daneshmand, G. Scutari and V . Kungurtsev, “Second-o rder guaran- tees of distributed gradient algorithms,” available as arXiv:1809.08694 , Sep. 2018

work page arXiv 2018
[34]

Annealing for Distributed Global Optimization

B. Swenson, S. Kar, H. V . Poor and J. M. F. Moura, “Anneali ng for distributed global optimization,” available as arXiv:1903.07258 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[35]

Non-convex optimization for machin e learning,

P . Jain and P . Kar, “Non-convex optimization for machin e learning,” F oundations and Trends in Machine Learning, vol. 10, no. 3-4, pp. 142– 336, 2017

work page 2017
[36]

Stochastic variance reduction for nonconvex optimization,

S. J. Reddi, A. Hefny, S. Sra, B. P´ ocz´ os, and A. Smola, “ Stochastic variance reduction for nonconvex optimization,” in Proc. of ICML , New Y ork, NY , USA, 2016, pp. 314–323

work page 2016
[37]

Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization

R. Ge, Z. Li, W. Wang and X. Wang, “Stabilized SVRG: Simpl e variance reduction for nonconvex optimization,” available as arXiv:1905.00529 , May 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1905
[38]

Cubic regularization of n ewton method and its global performance,

Y . Nesterov and B.T. Polyak, “Cubic regularization of n ewton method and its global performance,” Mathematical Programming, vol. 108, no. 1, pp. 177–205, Aug 2006

work page 2006
[39]

Klenke, Probability Theory: A Comprehensive Course , Springer, 2013

A. Klenke, Probability Theory: A Comprehensive Course , Springer, 2013

work page 2013
[40]

On the learning behavior of adap tive networks – Part II: Performance analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adap tive networks – Part II: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, June 2015

work page 2015
[41]

Asynchronous adaptation and le arning over networks – Part II: Performance analysis,

X. Zhao and A. H. Sayed, “Asynchronous adaptation and le arning over networks – Part II: Performance analysis,” IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 827–842, Feb 2015

work page 2015

[1] [1]

Diffusion learning in non-con vex envi- ronments,

S. Vlaski and A. H. Sayed, “Diffusion learning in non-con vex envi- ronments,” in Proc. of IEEE ICASSP , Brighton, UK, May 2019, pp. 5262–5266

work page 2019

[2] [2]

Distributed learning in non-c onvex envi- ronments – Part II: Polynomial escape from saddle-points,

S. Vlaski and A. H. Sayed, “Distributed learning in non-c onvex envi- ronments – Part II: Polynomial escape from saddle-points,” submitted for publication, see also arXiv version , July 2019

work page 2019

[3] [3]

Distributed subgradient meth ods for multi- agent optimization,

A. Nedic and A. Ozdaglar, “Distributed subgradient meth ods for multi- agent optimization,” IEEE Trans. Automatic Control , vol. 54, no. 1, pp. 48–61, Jan 2009

work page 2009

[4] [4]

Adaptation, learning, and optimization ov er networks,

A. H. Sayed, “Adaptation, learning, and optimization ov er networks,” F oundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311– 801, July 2014

work page 2014

[5] [5]

On the learning behavior of adapt ive networks - Part I: Transient analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adapt ive networks - Part I: Transient analysis,” IEEE Transactions on Information Theory , vol. 61, no. 6, pp. 3487–3517, June 2015

work page 2015

[6] [6]

Distributed stochastic optimization with gradient tracking over strongly-connected networks

R. Xin, A. K. Sahu, U. A. Khan, S. Kar, “Distributed stocha stic optimization with gradient tracking over strongly-connec ted networks,” available as arXiv:1903.07266 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[7] [7]

Stochastic g radient descent with ﬁnite samples sizes,

K. Y uan, B. Ying, S. Vlaski, and A. H. Sayed, “Stochastic g radient descent with ﬁnite samples sizes,” in Proc. of IEEE MLSP , Vietri sul Mare, Italy, Sep. 2016, pp. 1–6

work page 2016

[8] [8]

Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact ﬁrst-or der algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015

[9] [9]

Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,

K. Y uan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusio n for distributed optimization and learning – Part II: Convergen ce analysis,” IEEE Transactions on Signal Processing , vol. 67, no. 3, pp. 724–739, Feb 2019

work page 2019

[10] [10]

Adaptive networks,

A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE , vol. 102, no. 4, pp. 460–497, April 2014

work page 2014

[11] [11]

R. A. Horn and C. R. Johnson, Matrix Analysis , Cambridge University Press, 2003

work page 2003

[12] [12]

The Perron-Frobenius theorem: Some of its applications,

S. U. Pillai, T. Suel, and S. Cha, “The Perron-Frobenius theorem: Some of its applications,” IEEE Signal Processing Magazine , vol. 22, no. 2, pp. 62–75, March 2005

work page 2005

[13] [13]

Prox imal multitask learning over networks with sparsity-inducing coregulari zation,

R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Prox imal multitask learning over networks with sparsity-inducing coregulari zation,” IEEE Transactions on Signal Processing , vol. 64, no. 23, pp. 6329–6344, Dec 2016

work page 2016

[14] [14]

Adaptive penalty-based dis tributed stochastic convex optimization,

Z. J. Towﬁc and A. H. Sayed, “Adaptive penalty-based dis tributed stochastic convex optimization,” IEEE Trans. on Signal Process. , vol. 62, no. 15, pp. 3924–3938, Aug. 2014

work page 2014

[15] [15]

Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,

B. Ying and A. H. Sayed, “Performance limits of stochast ic sub-gradient learning, Part II: Multi-agent case,” Signal Processing, vol. 144, pp. 253 – 264, 2018

work page 2018

[16] [16]

Sup- port vector machines,

M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scho lkopf, “Sup- port vector machines,” IEEE Intelligent Systems and their Applications , vol. 13, no. 4, pp. 18–28, July 1998

work page 1998

[17] [17]

A. M. Zoubir, V . Koivunen, E. Ollila, and M. Muma, Robust Statistics for Signal Processing , Cambridge University Press, 2018

work page 2018

[18] [18]

Dictionary learning,

I. Tosic and P . Frossard, “Dictionary learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, March 2011

work page 2011

[19] [19]

The Loss Surfaces of Multilayer Networks,

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . LeCun, “The Loss Surfaces of Multilayer Networks,” in Proceedings of the Eighteenth International Conference on Artiﬁcial Inte lligence and Statistics, San Diego, May 2015, pp. 192–204

work page 2015

[20] [20]

Recursive stochastic algori thms for global optimization in Rd,

S. Gelfand and S. Mitter, “Recursive stochastic algori thms for global optimization in Rd,” SIAM Journal on Control and Optimization , vol. 29, no. 5, pp. 999–1018, 1991

work page 1991

[21] [21]

Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,

R. Ge, F. Huang, C. Jin, and Y . Y uan, “Escaping from saddl e pointsonline stochastic gradient for tensor decompositio n,” in Proc. of Conference on Learning Theory , Paris, France, 2015, pp. 797–842

work page 2015

[22] [22]

Gra dient descent only converges to minimizers,

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gra dient descent only converges to minimizers,” in 29th Annual Conference on Learning Theory, New Y ork, 2016, pp. 1246–1257

work page 2016

[23] [23]

How to escape saddle points efﬁciently,

C. Jin, R. Ge, P . Netrapalli, S. M. Kakade, and M. I. Jorda n, “How to escape saddle points efﬁciently,” in Proc. of ICML , Sydney, Australia, Aug. 2017, pp. 1724–1732

work page 2017

[24] [24]

Escaping Saddles with Stochastic Gradients

H. Daneshmand, J. Kohler, A. Lucchi and T. Hofmann, “Esc aping saddles with stochastic gradients,” available as arXiv:1803.05999 , March 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,

C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-opt imal non- convex optimization via stochastic path-integrated differential estimator,” in Proc. of NIPS , pp. 689–699. Montreal, Canada, 2018

work page 2018

[26] [26]

Natasha 2: Faster non-convex optimizat ion than SGD,

Z. Allen-Zhu, “Natasha 2: Faster non-convex optimizat ion than SGD,” in Proc. of NIPS , pp. 2675–2686. Montreal, Canada, Dec. 2018

work page 2018

[27] [27]

NEON2: Finding local minima via ﬁrst-order oracles,

Z. Allen-Zhu and Y . Li, “NEON2: Finding local minima via ﬁrst-order oracles,” in Proc. of NIPS , pp. 3716–3726. Montreal, Canada, Dec. 2018

work page 2018

[28] [28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

C. Fang, Z. Lin and T. Zhang, “Sharp analysis for nonconv ex sgd escaping from saddle points,” available as arXiv:1902.00247, Feb. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[29] [29]

Stochas- tic gradient descent escapes saddle points efﬁciently,

C. Jin, P . Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan , “Stochas- tic gradient descent escapes saddle points efﬁciently,” available as arXiv:1902.04811, Feb. 2019

work page arXiv 1902

[30] [30]

NEXT: in-network nonconv ex optimiza- tion,

P . Di Lorenzo and G. Scutari, “NEXT: in-network nonconv ex optimiza- tion,” IEEE Transactions on Signal and Information Processing ove r Networks, vol. 2, no. 2, pp. 120–136, June 2016

work page 2016

[31] [31]

Global convergence of ADMM in nonconvex nonsmooth optimization,

Y . Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” Journal of Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, Jan. 2019

work page 2019

[32] [32]

Non-convex distributed opt imization,

T. Tatarenko and B. Touri, “Non-convex distributed opt imization,” IEEE Transactions on Automatic Control , vol. 62, no. 8, pp. 3744–3757, Aug. 2017

work page 2017

[33] [33]

Second-o rder guaran- tees of distributed gradient algorithms,

A. Daneshmand, G. Scutari and V . Kungurtsev, “Second-o rder guaran- tees of distributed gradient algorithms,” available as arXiv:1809.08694 , Sep. 2018

work page arXiv 2018

[34] [34]

Annealing for Distributed Global Optimization

B. Swenson, S. Kar, H. V . Poor and J. M. F. Moura, “Anneali ng for distributed global optimization,” available as arXiv:1903.07258 , March 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[35] [35]

Non-convex optimization for machin e learning,

P . Jain and P . Kar, “Non-convex optimization for machin e learning,” F oundations and Trends in Machine Learning, vol. 10, no. 3-4, pp. 142– 336, 2017

work page 2017

[36] [36]

Stochastic variance reduction for nonconvex optimization,

S. J. Reddi, A. Hefny, S. Sra, B. P´ ocz´ os, and A. Smola, “ Stochastic variance reduction for nonconvex optimization,” in Proc. of ICML , New Y ork, NY , USA, 2016, pp. 314–323

work page 2016

[37] [37]

Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization

R. Ge, Z. Li, W. Wang and X. Wang, “Stabilized SVRG: Simpl e variance reduction for nonconvex optimization,” available as arXiv:1905.00529 , May 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1905

[38] [38]

Cubic regularization of n ewton method and its global performance,

Y . Nesterov and B.T. Polyak, “Cubic regularization of n ewton method and its global performance,” Mathematical Programming, vol. 108, no. 1, pp. 177–205, Aug 2006

work page 2006

[39] [39]

Klenke, Probability Theory: A Comprehensive Course , Springer, 2013

A. Klenke, Probability Theory: A Comprehensive Course , Springer, 2013

work page 2013

[40] [40]

On the learning behavior of adap tive networks – Part II: Performance analysis,

J. Chen and A. H. Sayed, “On the learning behavior of adap tive networks – Part II: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, June 2015

work page 2015

[41] [41]

Asynchronous adaptation and le arning over networks – Part II: Performance analysis,

X. Zhao and A. H. Sayed, “Asynchronous adaptation and le arning over networks – Part II: Performance analysis,” IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 827–842, Feb 2015

work page 2015