hub Canonical reference

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang · 2016 · cs.LG · arXiv 1609.04836

Canonical reference. 100% of citing Pith papers cite this work as background.

42 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 8

representative citing papers

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.

Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.

Characterizing and Correcting Effective Target Shift in Online Learning

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.

ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.

Estimating Implicit Regularization in Deep Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

cs.LG · 2026-04-16 · conditional · novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces

math.OC · 2026-04-12 · unverdicted · novelty 7.0

SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.

How does the optimizer implicitly bias the model merging loss landscape?

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

stat.ML · 2019-06-21 · unverdicted · novelty 7.0

Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

Generalization at the Edge of Stability

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.

Lorentz Framework for Semantic Segmentation

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3, SegFormer, Mask2Former and MaskFormer.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

cs.LG · 2025-02-08 · unverdicted · novelty 6.0

TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.

Optimization Hyper-parameter Laws for Large Language Models

cs.LG · 2024-09-07 · unverdicted · novelty 6.0

Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Sharpness-Aware Minimization for Efficiently Improving Generalization

cs.LG · 2020-10-03 · conditional · novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.

citing papers explorer

Showing 42 of 42 citing papers.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 9 · internal anchor
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization cs.LG · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.
Characterizing and Correcting Effective Target Shift in Online Learning stat.ML · 2026-05-08 · unverdicted · none · ref 62 · internal anchor
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees stat.ML · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.
Estimating Implicit Regularization in Deep Learning stat.ML · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence cs.LG · 2026-04-16 · conditional · none · ref 9 · internal anchor
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces math.OC · 2026-04-12 · unverdicted · none · ref 11 · internal anchor
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
How does the optimizer implicitly bias the model merging loss landscape? cs.LG · 2025-10-06 · unverdicted · none · ref 6 · internal anchor
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 66 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise stat.ML · 2019-06-21 · unverdicted · none · ref 9 · internal anchor
Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Feature Starvation as Geometric Instability in Sparse Autoencoders cs.LG · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 75 · internal anchor
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Generalization at the Edge of Stability cs.LG · 2026-04-21 · unverdicted · none · ref 44 · internal anchor
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
Lorentz Framework for Semantic Segmentation cs.CV · 2026-04-18 · unverdicted · none · ref 29 · internal anchor
A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3, SegFormer, Mask2Former and MaskFormer.
Robust Policy Optimization to Prevent Catastrophic Forgetting cs.LG · 2026-02-09 · unverdicted · none · ref 28 · internal anchor
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data cs.LG · 2025-02-08 · unverdicted · none · ref 261 · internal anchor
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
Optimization Hyper-parameter Laws for Large Language Models cs.LG · 2024-09-07 · unverdicted · none · ref 24 · internal anchor
Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 56 · internal anchor
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 278 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 200 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 158 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Sharpness-Aware Minimization for Efficiently Improving Generalization cs.LG · 2020-10-03 · conditional · none · ref 39 · internal anchor
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes cs.LG · 2019-04-01 · conditional · none · ref 10 · internal anchor
LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.
Improving Generalization by Permutation Routing Across Model Copies cs.LG · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 1 · internal anchor
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Sampling Parallelism for Fast and Efficient Bayesian Learning cs.LG · 2026-04-06 · unverdicted · none · ref 22 · internal anchor
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It eess.IV · 2026-04-02 · unverdicted · none · ref 37 · internal anchor
MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
Spectral methods: crucial for machine learning, natural for quantum computers? quant-ph · 2026-03-25 · unverdicted · none · ref 71 · internal anchor
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
Intelligence Inertia: Physical Isomorphism and Applications cs.AI · 2026-03-22 · unverdicted · none · ref 54 · internal anchor
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity cs.LG · 2025-11-18 · unverdicted · none · ref 31 · internal anchor
WCR is a new training regularizer that concentrates weight magnitudes onto few parameters to improve one-shot pruning robustness under aggressive sparsity.
The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks quant-ph · 2025-04-27 · unverdicted · none · ref 29 · internal anchor
Increasing the number of local patches in a distributed quantum neural network architecture reduces the largest Hessian eigenvalue at minima and introduces a class-dependent outlier structure in the eigenspectrum.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 13 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning cs.LG · 2026-05-08 · unverdicted · none · ref 50 · internal anchor
The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gradient responsiveness objectives, claiming SOTA results.
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots cs.RO · 2026-04-21 · unverdicted · none · ref 27 · internal anchor
OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems cs.LG · 2019-07-16 · unverdicted · none · ref 26 · internal anchor
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
On improving deep learning generalization with adaptive sparse connectivity cs.NE · 2019-06-27 · unverdicted · none · ref 5 · internal anchor
Sparse MLPs trained via SET plus neuron pruning achieve competitive performance on 15 datasets while pruning ~50% of hidden neurons and keeping parameter count linear in neuron count.
Is It Worth the Attention? A Comparative Evaluation of Attention Layers for Argument Unit Segmentation cs.CL · 2019-06-24 · unverdicted · none · ref 15 · internal anchor
Attention layers do not improve BiLSTM performance on argument unit segmentation and contextualized embeddings show little benefit.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 111 · internal anchor
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
An overview of condensation phenomenon in deep learning cs.LG · 2025-04-13 · unverdicted · none · ref 10 · internal anchor
Neural networks exhibit condensation of neurons into clusters with similar outputs whose number increases monotonically during training, facilitated by small initializations or dropout, providing insights into generalization and reasoning.
A Sharper Picture of Generalization in Transformers cs.LG · 2026-05-20 · unreviewed · ref 15 · internal anchor

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer