Deep Learning without Poor Local Minima

· 2016 · stat.ML · arXiv 1605.07110

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

representative citing papers

Absence of poor local minima in matrix product states

quant-ph · 2026-06-08 · unverdicted · novelty 7.0

MPS energy landscapes lack poor local minima because gauge freedom induces overparametrization that concentrates local minima near the global minimum, with the local minimum distribution proven invariant under orthogonality center moves.

Statistical Properties of Training & Generalization

stat.ML · 2026-06-18 · unverdicted · novelty 1.0

Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Statistical Properties of Training & Generalization stat.ML · 2026-06-18 · unverdicted · none · ref 57 · internal anchor
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

Deep Learning without Poor Local Minima

fields

years

verdicts

representative citing papers

citing papers explorer