hub

Entropy-sgd: Biasing gradient descent into wide valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun · 2016 · cs.LG · arXiv 1611.01838

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open full Pith review browse 11 citing papers arXiv PDF

abstract

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

stat.ML · 2019-06-21 · unverdicted · novelty 7.0

Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.

Estimating Implicit Regularization in Deep Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.

Sharpness-Aware Minimization for Efficiently Improving Generalization

cs.LG · 2020-10-03 · conditional · novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.

Heavy-ball Algorithms Always Escape Saddle Points

math.OC · 2019-07-23 · unverdicted · novelty 6.0

Heavy-ball methods with random starts provably escape saddle points via a new state-space mapping that allows larger steps than plain gradient descent.

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

cs.LG · 2019-06-26 · unverdicted · novelty 6.0

Derives algorithm-dependent generalization bounds for neural nets using multilevel entropic regularization and proposes a Metropolis-simulated multi-scale Gibbs training procedure tested on a two-layer net for MNIST.

A Stochastic Composite Gradient Method with Incremental Variance Reduction

math.OC · 2019-06-24 · unverdicted · novelty 6.0

Proposes an incremental variance-reduced stochastic gradient method for minimizing smooth nonconvex composite functions that achieves optimal first-order complexity rates.

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

cs.LG · 2016-09-15 · unverdicted · novelty 6.0

Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

cs.LG · 2019-06-26 · unverdicted · novelty 5.0

GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

cs.LG · 2019-07-24 · unverdicted · novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

citing papers explorer

Showing 11 of 11 citing papers.

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise stat.ML · 2019-06-21 · unverdicted · none · ref 4 · internal anchor
Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.
Estimating Implicit Regularization in Deep Learning stat.ML · 2026-05-06 · unverdicted · none · ref 11
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing cs.LG · 2026-05-15 · unverdicted · none · ref 38 · internal anchor
Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
Sharpness-Aware Minimization for Efficiently Improving Generalization cs.LG · 2020-10-03 · conditional · none · ref 2 · internal anchor
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Heavy-ball Algorithms Always Escape Saddle Points math.OC · 2019-07-23 · unverdicted · none · ref 6 · internal anchor
Heavy-ball methods with random starts provably escape saddle points via a new state-space mapping that allows larger steps than plain gradient descent.
Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets cs.LG · 2019-06-26 · unverdicted · none · ref 28 · internal anchor
Derives algorithm-dependent generalization bounds for neural nets using multilevel entropic regularization and proposes a Metropolis-simulated multi-scale Gibbs training procedure tested on a two-layer net for MNIST.
A Stochastic Composite Gradient Method with Incremental Variance Reduction math.OC · 2019-06-24 · unverdicted · none · ref 5 · internal anchor
Proposes an incremental variance-reduced stochastic gradient method for minimizing smooth nonconvex composite functions that achieves optimal first-order complexity rates.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 31
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima cs.LG · 2016-09-15 · unverdicted · none · ref 2
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 4 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization cs.LG · 2019-07-24 · unverdicted · none · ref 10 · internal anchor
Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

Entropy-sgd: Biasing gradient descent into wide valleys

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer