Qualitatively characterizing neural network optimization problems

Andrew M. Saxe; Ian J. Goodfellow; Oriol Vinyals

arxiv: 1412.6544 · v6 · pith:QBUMYC6Cnew · submitted 2014-12-19 · 💻 cs.NE · cs.LG· stat.ML

Qualitatively characterizing neural network optimization problems

Ian J. Goodfellow , Oriol Vinyals , Andrew M. Saxe This is my paper

classification 💻 cs.NE cs.LGstat.ML

keywords networksneuraloptimizationtraininglocalobstaclesproblemsvariety

0 comments

read the original abstract

Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A ghost mechanism: An analytical model of abrupt learning in recurrent networks
cs.LG 2025-01 unverdicted novelty 7.0

The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape
cs.LG 2019-07 conditional novelty 7.0

Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
X-SYS: A Reference Architecture for Interactive Explanation Systems
cs.AI 2026-02 unverdicted novelty 6.0

X-SYS is a reference architecture for interactive explanation systems organized around STAR quality attributes and five service components, demonstrated via SemanticLens for vision-language models.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Federated Learning with Non-IID Data
cs.LG 2018-06 conditional novelty 6.0

Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
cs.LG 2026-04 unverdicted novelty 5.0

A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.