arxiv: 1609.04836 · v2 · submitted 2016-09-15 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , Ping Tak Peter Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:54 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords deep learningstochastic gradient descentbatch sizegeneralization gapsharp minimaflat minimaloss landscape

0 comments

The pith

Large-batch SGD converges to sharp minima that generalize worse than the flat minima reached by small-batch methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the observed drop in generalization performance when stochastic gradient descent uses large batches instead of the usual small ones. Numerical experiments show that large-batch runs reliably settle at sharp minima of both training and test loss surfaces, while small-batch runs reach flatter minima. The authors link the difference to the higher noise level in small-batch gradient estimates, which appears to steer the optimizer away from sharp points. They also outline several practical strategies for large-batch training that aim to recover the generalization quality of small-batch methods. The finding matters because larger batches enable faster wall-clock training on parallel hardware, yet only if the final models remain competitive on unseen data.

Core claim

Large-batch methods tend to converge to sharp minimizers of the training and testing functions, which lead to poorer generalization, whereas small-batch methods consistently converge to flat minimizers due to the inherent noise in the gradient estimation.

What carries the argument

Sharpness of the loss minima, measured by the curvature or width of the basin, which correlates directly with generalization performance.

If this is right

Large-batch training can close the generalization gap if techniques are applied to favor flatter minima.
The noise from small-batch gradients functions as an implicit regularizer that promotes flatter solutions.
Hyperparameter adjustments such as learning-rate scaling can partially mitigate the sharpness issue in large-batch regimes.
Parallel hardware speedups from large batches become usable in practice only after the sharpness-related gap is addressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sharpness is causal, then any method that penalizes high curvature could be combined with large batches to improve generalization without reducing batch size.
The result implies that optimal batch size may depend on the geometry of the particular loss surface rather than solely on hardware efficiency.
Controlled experiments adding synthetic noise to large-batch gradients could test whether the flat-minima benefit can be reproduced directly.

Load-bearing premise

The observed difference in sharpness between large-batch and small-batch minima is the primary driver of the generalization gap rather than a side effect of other training factors.

What would settle it

Train a network with large batches while explicitly encouraging a flat minimum through added regularization or modified loss, then measure whether test accuracy matches or exceeds that of small-batch training.

read the original abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that large-batch SGD for deep learning converges to sharp minimizers of the training and test loss, which generalize poorly, while small-batch SGD consistently reaches flatter minima due to gradient noise; this is supported by experiments across tasks and the authors discuss mitigation strategies for the large-batch generalization gap.

Significance. If the correlation between batch size, sharpness, and generalization is robust, the work offers a useful empirical lens on a key practical issue in scaling deep learning optimization. The consistent patterns reported across tasks are a strength, though the manuscript would benefit from tighter isolation of mechanisms.

major comments (2)

[Experiments section] Experiments section: the large- versus small-batch comparisons do not match total parameter updates or control for learning-rate scaling effects that necessarily change with batch size; these factors could independently influence the geometry of the reached minima.
[Sharpness and generalization discussion] Sharpness and generalization discussion: the central claim that sharpness is the primary operative cause of the observed generalization gap (rather than a correlated byproduct of altered noise, update count, or basin selection) lacks a controlled intervention that varies only curvature or noise while holding other dynamics fixed.

minor comments (2)

Specify the precise definition and numerical procedure for the sharpness metric (top Hessian eigenvalue and directional curvature) and report the exact data splits and hyper-parameter schedules used.
Clarify whether learning-rate scaling was linear or square-root with batch size and include this detail in all experimental tables.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address the major comments point by point below, indicating where revisions have been made.

read point-by-point responses

Referee: [Experiments section] Experiments section: the large- versus small-batch comparisons do not match total parameter updates or control for learning-rate scaling effects that necessarily change with batch size; these factors could independently influence the geometry of the reached minima.

Authors: We agree that training for a fixed number of epochs results in fewer total parameter updates for larger batches, and that learning-rate scaling is required when changing batch size. Our experiments followed standard practice (fixed epochs, linear LR scaling with batch size) to reflect realistic large-batch usage. In the revised manuscript we have added a controlled comparison that matches the total number of gradient updates across batch sizes by reducing the epoch count for the small-batch case. The sharpness and generalization differences persist under this protocol. We have also expanded the discussion to explicitly note the potential influence of update count and LR scaling as confounding factors. revision: yes
Referee: [Sharpness and generalization discussion] Sharpness and generalization discussion: the central claim that sharpness is the primary operative cause of the observed generalization gap (rather than a correlated byproduct of altered noise, update count, or basin selection) lacks a controlled intervention that varies only curvature or noise while holding other dynamics fixed.

Authors: We acknowledge that the evidence presented is primarily correlational: large-batch training consistently reaches sharper minima with poorer generalization, while small-batch training reaches flatter minima. We do not claim to have performed a controlled intervention that varies only curvature while holding noise, update count, and basin selection fixed; such an experiment is technically challenging in high-dimensional non-convex landscapes. The revised discussion section now frames the results more cautiously, emphasizing the observed correlation and the role of gradient noise in helping small-batch methods avoid sharp regions, while listing alternative explanations without asserting that sharpness is the sole causal mechanism. revision: partial

standing simulated objections not resolved

A fully controlled intervention that isolates sharpness (or noise) as the sole causal factor while holding update count, learning-rate scaling, and basin selection fixed is not feasible with current methods and remains outside the scope of the present work.

Circularity Check

0 steps flagged

No circularity: empirical observations from controlled training runs

full rationale

The paper is an empirical study that trains identical architectures under small- and large-batch regimes, records the resulting minima, and measures sharpness directly via the top Hessian eigenvalue and directional curvature. The central claim (large-batch solutions are sharper and generalize worse) is obtained by these measurements rather than by any algebraic derivation, parameter fit renamed as prediction, or self-referential definition. The passing reference to the known link between sharpness and generalization is attributed to prior external literature and is not load-bearing for the paper's own experimental result. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from the authors' prior work appears in the derivation; the observations remain falsifiable against independent runs on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that flatter minima generalize better, an idea drawn from prior literature rather than derived here. No new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Sharp minima of the loss surface generalize worse than flat minima
Invoked in the abstract as 'as is well known' without new derivation.

pith-pipeline@v0.9.0 · 5487 in / 1130 out tokens · 26344 ms · 2026-05-13T10:54:10.371802+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
cs.LG 2026-05 unverdicted novelty 7.0

LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
cs.LG 2026-05 unverdicted novelty 7.0

Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
Characterizing and Correcting Effective Target Shift in Online Learning
stat.ML 2026-05 unverdicted novelty 7.0

Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees
stat.ML 2026-05 unverdicted novelty 7.0

ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.
Estimating Implicit Regularization in Deep Learning
stat.ML 2026-05 unverdicted novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
cs.LG 2026-05 unverdicted novelty 7.0

EPGS detects stubborn hallucinations by perturbing embeddings with noise and tracking gradient magnitude spikes, outperforming entropy and representation baselines as a proxy for loss landscape sharpness.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
cs.LG 2026-04 conditional novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
math.OC 2026-04 unverdicted novelty 7.0

SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
cs.LG 2025-04 accept novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
Feature Starvation as Geometric Instability in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Generalization at the Edge of Stability
cs.LG 2026-04 unverdicted novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
Lorentz Framework for Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Improving Generalization by Permutation Routing Across Model Copies
cs.LG 2026-05 unverdicted novelty 5.0

Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
cs.LG 2026-05 unverdicted novelty 5.0

EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Sampling Parallelism for Fast and Efficient Bayesian Learning
cs.LG 2026-04 unverdicted novelty 5.0

Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
eess.IV 2026-04 unverdicted novelty 5.0

MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
Spectral methods: crucial for machine learning, natural for quantum computers?
quant-ph 2026-03 unverdicted novelty 5.0

Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
Intelligence Inertia: Physical Isomorphism and Applications
cs.AI 2026-03 unverdicted novelty 5.0

Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
cs.LG 2026-05 unverdicted novelty 4.0

The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gra...
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots
cs.RO 2026-04 unverdicted novelty 4.0

OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
There Will Be a Scientific Theory of Deep Learning
stat.ML 2026-04 unverdicted novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 26 Pith papers · 4 internal anchors

[1]

Optimization methods for large-scale machine learning

L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838,

work page arXiv
[2]

Entropy-sgd: Biasing gradient descent into wide valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838,

work page arXiv
[3]

Distributed deep learning using syn- chronous stochastic gradient descent

Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Srid- haran, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using syn- chronous stochastic gradient descent. arXiv preprint arXiv:1602.06709,

work page arXiv
[4]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014a. Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014b. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

Train faster, generalize better: Stability of stochastic gradient descent

M. Hardt, B. Recht, and Y . Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240,

work page arXiv
[6]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

10 Published as a conference paper at ICLR 2017 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,

work page internal anchor Pith review arXiv 2017
[7]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015),

work page 2015
[8]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier- stra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Training recurrent neural networks by diffusion.arXiv preprint arXiv:1601.04114,

Hossein Mobahi. Training recurrent neural networks by diffusion.arXiv preprint arXiv:1601.04114,

work page arXiv
[10]

The kaldi speech recognition toolkit

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding , number EPFL-CONF-192584. IEEE Signal Processing Society,

work page 2011
[11]

Understanding adversarial training: Increasing local stability of neural nets through robust optimization.arXiv preprint arXiv:1511.05432,

Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization.arXiv preprint arXiv:1511.05432,

work page arXiv
[12]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

No bad local minima: Data independent training error guarantees for multilayer neural networks

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361,

work page Pith review arXiv
[14]

Dropout: a simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(1):1929–1958,

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(1):1929–1958,

work page 1929
[15]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pp. 1139–1147,

work page 2013
[16]

Deep learning with elastic averaging sgd

11 Published as a conference paper at ICLR 2017 Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693,

work page 2017
[17]

Improving the robustness of deep neural networks via stability training

Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. arXiv preprint arXiv:1604.04326,

work page arXiv
[18]

The rest of the data sets are used without any pre-processing

and trained using a fully-connected network. The rest of the data sets are used without any pre-processing. Table 5: Data Sets # Data Points Data Set Train Test # Features # Classes Reference MNIST 60000 10000 28× 28 10 (LeCun et al., 1998a;b) TIMIT 721329 310621 360 1973 (Garofolo et al.,

work page 1973
[19]

The output layer consists of 10 neurons with the softmax activation

layers of 512 neurons each with ReLU activations. The output layer consists of 10 neurons with the softmax activation. B.2 N ETWORK F2 The network architecture forF2 is similar toF1. We use a 360-dimensional input layer followed by 7 batch-normalized layers of 512 neurons with ReLU activation. The output layer consists of 1973 neurons with the softmax act...

work page 1973
[20]

TheC3 network uses the conﬁguration:2×[64, 3, 3, 1], 2×[128, 3, 3, 1], 3×[256, 3, 3, 1], 3× [512, 3, 3, 1], 3×[512, 3, 3, 1] which a MaxPool(2) after each stack

B.4 N ETWORKS C2 AND C4 TheC2 network is a modiﬁed version of the popular VGG conﬁguration (Simonyan & Zisserman, 2014). TheC3 network uses the conﬁguration:2×[64, 3, 3, 1], 2×[128, 3, 3, 1], 3×[256, 3, 3, 1], 3× [512, 3, 3, 1], 3×[512, 3, 3, 1] which a MaxPool(2) after each stack. This stack is followed by a512- dimensional dense layer and ﬁnally, a10-di...

work page 2014
[21]

Such and algorithm might also improve training time through faster convergence

12 Published as a conference paper at ICLR 2017 C P ERFORMANCE MODEL As mentioned in Section 1, a training algorithm that operates in the large-batch regime without suffering from a generalization gap would have the ability to scale to much larger number of nodes than is currently possible. Such and algorithm might also improve training time through faste...

work page 2017
[22]

As in Section 2, we use 10% as the percentage batch-size for large-batch experiments and 256 for small-batch methods

E A TTEMPTS TO IMPROVE LB M ETHODS In this section, we discuss a few strategies that aim to remedy the problem of poor generalization for large-batch methods. As in Section 2, we use 10% as the percentage batch-size for large-batch experiments and 256 for small-batch methods. For all experiments, we use ADAM as the optimizer irrespective of batch-size. E....

work page 2012
[23]

For the augmentation, we use horizontal reﬂections, random rotations up to 10◦ and random translation of up to 0.2 times the size of the image. It is evident from the table that, while the LB method achieves accuracy comparable to the SB method (also with training data augmented), the sharpness of the minima still exists, suggesting sensitivity to images ...

work page 2017
[24]

Again, there is a statistically signiﬁcant improvement in the testing accuracy of the large-batch method but it does not solve the problem of sensitivity

In all experiments, we solve the problem (5) using 3 iterations of ADAM and set the regularization parameter λ to be 10−3. Again, there is a statistically signiﬁcant improvement in the testing accuracy of the large-batch method but it does not solve the problem of sensitivity. E.3 R OBUST TRAINING A natural way of avoiding sharp minima is through robust o...

work page 2010