No bad local minima: Data independent training error guarantees for multilayer neural networks

Daniel Soudry, Yair Carmon · 2016 · stat.ML · arXiv 1605.08361

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

representative citing papers

Flat Channels to Infinity in Neural Loss Landscapes

cs.LG · 2025-06-17 · unverdicted · novelty 7.0

Neural loss landscapes contain flat channels to infinity along which gradient flow leads pairs of neurons to implement gated linear units.

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

cs.LG · 2019-07-05 · conditional · novelty 7.0

Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

cs.LG · 2024-01-02 · unverdicted · novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

cs.LG · 2019-06-30 · unverdicted · novelty 6.0

Presents an active-sampling method that approximates the weight subspace from Hessian finite differences, recovers the rank-1 tensors by robust nonlinear programming, and attributes layers with gradient descent, yielding stable recovery under a-posteriori verifiable conditions.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

cs.LG · 2016-09-15 · unverdicted · novelty 6.0

Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.

citing papers explorer

Showing 5 of 5 citing papers.

Flat Channels to Infinity in Neural Loss Landscapes cs.LG · 2025-06-17 · unverdicted · none · ref 16 · internal anchor
Neural loss landscapes contain flat channels to infinity along which gradient flow leads pairs of neurons to implement gated linear units.
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape cs.LG · 2019-07-05 · conditional · none · ref 19 · internal anchor
Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models cs.LG · 2024-01-02 · unverdicted · none · ref 111 · internal anchor
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks cs.LG · 2019-06-30 · unverdicted · none · ref 57 · internal anchor
Presents an active-sampling method that approximates the weight subspace from Hessian finite differences, recovers the rank-1 tensors by robust nonlinear programming, and attributes layers with gradient descent, yielding stable recovery under a-posteriori verifiable conditions.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima cs.LG · 2016-09-15 · unverdicted · none · ref 13
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.

No bad local minima: Data independent training error guarantees for multilayer neural networks

fields

years

verdicts

representative citing papers

citing papers explorer