pith. sign in

hub Canonical reference

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Canonical reference. 100% of citing Pith papers cite this work as background.

56 Pith papers citing it
Background 100% of classified citations
abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

hub tools

citation-role summary

background 8

citation-polarity summary

roles

background 8

polarities

background 8

clear filters

representative citing papers

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

A Sharper Picture of Generalization in Transformers

cs.LG · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

PAC-Bayes applied to low-sharpness flat minima yields non-vacuous generalization bounds for boolean functions whose Fourier spectra are sparse and low-degree, with parameters estimable by property testing.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.