Recognition: 2 theorem links
· Lean TheoremOn Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Pith reviewed 2026-05-13 10:54 UTC · model grok-4.3
The pith
Large-batch SGD converges to sharp minima that generalize worse than the flat minima reached by small-batch methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large-batch methods tend to converge to sharp minimizers of the training and testing functions, which lead to poorer generalization, whereas small-batch methods consistently converge to flat minimizers due to the inherent noise in the gradient estimation.
What carries the argument
Sharpness of the loss minima, measured by the curvature or width of the basin, which correlates directly with generalization performance.
If this is right
- Large-batch training can close the generalization gap if techniques are applied to favor flatter minima.
- The noise from small-batch gradients functions as an implicit regularizer that promotes flatter solutions.
- Hyperparameter adjustments such as learning-rate scaling can partially mitigate the sharpness issue in large-batch regimes.
- Parallel hardware speedups from large batches become usable in practice only after the sharpness-related gap is addressed.
Where Pith is reading between the lines
- If sharpness is causal, then any method that penalizes high curvature could be combined with large batches to improve generalization without reducing batch size.
- The result implies that optimal batch size may depend on the geometry of the particular loss surface rather than solely on hardware efficiency.
- Controlled experiments adding synthetic noise to large-batch gradients could test whether the flat-minima benefit can be reproduced directly.
Load-bearing premise
The observed difference in sharpness between large-batch and small-batch minima is the primary driver of the generalization gap rather than a side effect of other training factors.
What would settle it
Train a network with large batches while explicitly encouraging a flat minimum through added regularization or modified loss, then measure whether test accuracy matches or exceeds that of small-batch training.
read the original abstract
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large-batch SGD for deep learning converges to sharp minimizers of the training and test loss, which generalize poorly, while small-batch SGD consistently reaches flatter minima due to gradient noise; this is supported by experiments across tasks and the authors discuss mitigation strategies for the large-batch generalization gap.
Significance. If the correlation between batch size, sharpness, and generalization is robust, the work offers a useful empirical lens on a key practical issue in scaling deep learning optimization. The consistent patterns reported across tasks are a strength, though the manuscript would benefit from tighter isolation of mechanisms.
major comments (2)
- [Experiments section] Experiments section: the large- versus small-batch comparisons do not match total parameter updates or control for learning-rate scaling effects that necessarily change with batch size; these factors could independently influence the geometry of the reached minima.
- [Sharpness and generalization discussion] Sharpness and generalization discussion: the central claim that sharpness is the primary operative cause of the observed generalization gap (rather than a correlated byproduct of altered noise, update count, or basin selection) lacks a controlled intervention that varies only curvature or noise while holding other dynamics fixed.
minor comments (2)
- Specify the precise definition and numerical procedure for the sharpness metric (top Hessian eigenvalue and directional curvature) and report the exact data splits and hyper-parameter schedules used.
- Clarify whether learning-rate scaling was linear or square-root with batch size and include this detail in all experimental tables.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address the major comments point by point below, indicating where revisions have been made.
read point-by-point responses
-
Referee: [Experiments section] Experiments section: the large- versus small-batch comparisons do not match total parameter updates or control for learning-rate scaling effects that necessarily change with batch size; these factors could independently influence the geometry of the reached minima.
Authors: We agree that training for a fixed number of epochs results in fewer total parameter updates for larger batches, and that learning-rate scaling is required when changing batch size. Our experiments followed standard practice (fixed epochs, linear LR scaling with batch size) to reflect realistic large-batch usage. In the revised manuscript we have added a controlled comparison that matches the total number of gradient updates across batch sizes by reducing the epoch count for the small-batch case. The sharpness and generalization differences persist under this protocol. We have also expanded the discussion to explicitly note the potential influence of update count and LR scaling as confounding factors. revision: yes
-
Referee: [Sharpness and generalization discussion] Sharpness and generalization discussion: the central claim that sharpness is the primary operative cause of the observed generalization gap (rather than a correlated byproduct of altered noise, update count, or basin selection) lacks a controlled intervention that varies only curvature or noise while holding other dynamics fixed.
Authors: We acknowledge that the evidence presented is primarily correlational: large-batch training consistently reaches sharper minima with poorer generalization, while small-batch training reaches flatter minima. We do not claim to have performed a controlled intervention that varies only curvature while holding noise, update count, and basin selection fixed; such an experiment is technically challenging in high-dimensional non-convex landscapes. The revised discussion section now frames the results more cautiously, emphasizing the observed correlation and the role of gradient noise in helping small-batch methods avoid sharp regions, while listing alternative explanations without asserting that sharpness is the sole causal mechanism. revision: partial
- A fully controlled intervention that isolates sharpness (or noise) as the sole causal factor while holding update count, learning-rate scaling, and basin selection fixed is not feasible with current methods and remains outside the scope of the present work.
Circularity Check
No circularity: empirical observations from controlled training runs
full rationale
The paper is an empirical study that trains identical architectures under small- and large-batch regimes, records the resulting minima, and measures sharpness directly via the top Hessian eigenvalue and directional curvature. The central claim (large-batch solutions are sharper and generalize worse) is obtained by these measurements rather than by any algebraic derivation, parameter fit renamed as prediction, or self-referential definition. The passing reference to the known link between sharpness and generalization is attributed to prior external literature and is not load-bearing for the paper's own experimental result. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from the authors' prior work appears in the derivation; the observations remain falsifiable against independent runs on the same benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sharp minima of the loss surface generalize worse than flat minima
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 27 Pith papers
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
-
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
-
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
-
Characterizing and Correcting Effective Target Shift in Online Learning
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
-
ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees
ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.
-
Estimating Implicit Regularization in Deep Learning
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
-
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
EPGS detects stubborn hallucinations by perturbing embeddings with noise and tracking gradient magnitude spikes, outperforming entropy and representation baselines as a proxy for loss landscape sharpness.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
Generalization at the Edge of Stability
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
-
Lorentz Framework for Semantic Segmentation
A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Improving Generalization by Permutation Routing Across Model Copies
Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
-
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
-
Sampling Parallelism for Fast and Efficient Bayesian Learning
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
-
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
-
Spectral methods: crucial for machine learning, natural for quantum computers?
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
-
Intelligence Inertia: Physical Isomorphism and Applications
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
-
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gra...
-
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots
OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
Reference graph
Works this paper leans on
-
[1]
Optimization methods for large-scale machine learning
L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838,
-
[2]
Entropy-sgd: Biasing gradient descent into wide valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838,
-
[3]
Distributed deep learning using syn- chronous stochastic gradient descent
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Srid- haran, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using syn- chronous stochastic gradient descent. arXiv preprint arXiv:1602.06709,
-
[4]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014a. Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014b. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[5]
Train faster, generalize better: Stability of stochastic gradient descent
M. Hardt, B. Recht, and Y . Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240,
-
[6]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
10 Published as a conference paper at ICLR 2017 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,
work page internal anchor Pith review arXiv 2017
-
[7]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015),
work page 2015
-
[8]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier- stra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training recurrent neural networks by diffusion.arXiv preprint arXiv:1601.04114,
Hossein Mobahi. Training recurrent neural networks by diffusion.arXiv preprint arXiv:1601.04114,
-
[10]
The kaldi speech recognition toolkit
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding , number EPFL-CONF-192584. IEEE Signal Processing Society,
work page 2011
-
[11]
Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization.arXiv preprint arXiv:1511.05432,
-
[12]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
No bad local minima: Data independent training error guarantees for multilayer neural networks
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361,
-
[14]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(1):1929–1958,
work page 1929
-
[15]
I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pp. 1139–1147,
work page 2013
-
[16]
Deep learning with elastic averaging sgd
11 Published as a conference paper at ICLR 2017 Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693,
work page 2017
-
[17]
Improving the robustness of deep neural networks via stability training
Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. arXiv preprint arXiv:1604.04326,
-
[18]
The rest of the data sets are used without any pre-processing
and trained using a fully-connected network. The rest of the data sets are used without any pre-processing. Table 5: Data Sets # Data Points Data Set Train Test # Features # Classes Reference MNIST 60000 10000 28× 28 10 (LeCun et al., 1998a;b) TIMIT 721329 310621 360 1973 (Garofolo et al.,
work page 1973
-
[19]
The output layer consists of 10 neurons with the softmax activation
layers of 512 neurons each with ReLU activations. The output layer consists of 10 neurons with the softmax activation. B.2 N ETWORK F2 The network architecture forF2 is similar toF1. We use a 360-dimensional input layer followed by 7 batch-normalized layers of 512 neurons with ReLU activation. The output layer consists of 1973 neurons with the softmax act...
work page 1973
-
[20]
B.4 N ETWORKS C2 AND C4 TheC2 network is a modified version of the popular VGG configuration (Simonyan & Zisserman, 2014). TheC3 network uses the configuration:2×[64, 3, 3, 1], 2×[128, 3, 3, 1], 3×[256, 3, 3, 1], 3× [512, 3, 3, 1], 3×[512, 3, 3, 1] which a MaxPool(2) after each stack. This stack is followed by a512- dimensional dense layer and finally, a10-di...
work page 2014
-
[21]
Such and algorithm might also improve training time through faster convergence
12 Published as a conference paper at ICLR 2017 C P ERFORMANCE MODEL As mentioned in Section 1, a training algorithm that operates in the large-batch regime without suffering from a generalization gap would have the ability to scale to much larger number of nodes than is currently possible. Such and algorithm might also improve training time through faste...
work page 2017
-
[22]
E A TTEMPTS TO IMPROVE LB M ETHODS In this section, we discuss a few strategies that aim to remedy the problem of poor generalization for large-batch methods. As in Section 2, we use 10% as the percentage batch-size for large-batch experiments and 256 for small-batch methods. For all experiments, we use ADAM as the optimizer irrespective of batch-size. E....
work page 2012
-
[23]
For the augmentation, we use horizontal reflections, random rotations up to 10◦ and random translation of up to 0.2 times the size of the image. It is evident from the table that, while the LB method achieves accuracy comparable to the SB method (also with training data augmented), the sharpness of the minima still exists, suggesting sensitivity to images ...
work page 2017
-
[24]
In all experiments, we solve the problem (5) using 3 iterations of ADAM and set the regularization parameter λ to be 10−3. Again, there is a statistically significant improvement in the testing accuracy of the large-batch method but it does not solve the problem of sensitivity. E.3 R OBUST TRAINING A natural way of avoiding sharp minima is through robust o...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.