Exponentially vanishing sub-optimal local minima in multilayer neural networks

Daniel Soudry; Elad Hoffer

arxiv: 1702.05777 · v5 · pith:4SXXL6IAnew · submitted 2017-02-19 · 📊 stat.ML

Exponentially vanishing sub-optimal local minima in multilayer neural networks

Daniel Soudry , Elad Hoffer This is my paper

classification 📊 stat.ML

keywords hiddenminimaerrorexponentiallyhighleftlocalomega

0 comments

read the original abstract

Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska et al. (2015)) suggest that local minima with high error are exponentially rare in high dimensions. However, to prove low error guarantees for Multilayer Neural Networks (MNNs), previous works so far required either a heavily modified MNN model or training method, strong assumptions on the labels (e.g., "near" linear separability), or an unrealistic hidden layer with $\Omega\left(N\right)$ units. Results: We examine a MNN with one hidden layer of piecewise linear units, a single output, and a quadratic loss. We prove that, with high probability in the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima, given standard normal input of dimension $d_{0}=\tilde{\Omega}\left(\sqrt{N}\right)$, and a more realistic number of $d_{1}=\tilde{\Omega}\left(N/d_{0}\right)$ hidden units. We demonstrate our results numerically: for example, $0\%$ binary classification training error on CIFAR with only $N/d_{0}\approx 16$ hidden neurons.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exploring Model-based Planning with Policy Networks
cs.LG 2019-06 unverdicted novelty 7.0

POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...