arxiv: 2010.01412 · v3 · submitted 2020-10-03 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret , Ariel Kleiner , Hossein Mobahi , Behnam Neyshabur

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords sharpness-aware minimizationgeneralizationloss landscapemin-max optimizationdeep learninglabel noiseimage classification

0 comments

The pith

Sharpness-Aware Minimization finds parameters in flat loss neighborhoods to improve generalization over standard training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Overparameterized models can achieve low training loss yet still generalize poorly. The paper introduces Sharpness-Aware Minimization (SAM), a procedure that simultaneously minimizes loss value and loss sharpness by seeking parameters whose neighborhoods have uniformly low loss. This is expressed as a min-max optimization problem that gradient descent can solve efficiently. Experiments demonstrate that SAM improves generalization on CIFAR-10, CIFAR-100, ImageNet and finetuning tasks, reaching new state-of-the-art results on several benchmarks while also matching specialized methods in robustness to label noise.

Core claim

SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently, leading to improved generalization across benchmark datasets and models.

What carries the argument

The min-max objective that minimizes the maximum loss value inside a neighborhood of fixed radius around the current parameters, approximated via a first-order Taylor expansion for efficient gradient computation.

Load-bearing premise

That parameters whose neighborhoods have uniformly low loss will reliably generalize better than parameters found by minimizing training loss alone.

What would settle it

An experiment on a standard benchmark where SAM training produces worse test accuracy than standard gradient descent while using identical model size, data, and hyperparameter budgets.

read the original abstract

In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at \url{https://github.com/google-research/sam}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAM turns the flat-minima idea into a workable min-max optimizer that beats standard training on the usual vision benchmarks, though the one-step inner approximation leaves the geometric claim only loosely supported.

read the letter

The core contribution is a practical procedure that minimizes loss while also penalizing sharpness in a small neighborhood around the current parameters. They cast this as min_w max_{||eps||<=rho} L(w+eps) and approximate the inner max with a single gradient-ascent step of size rho. That keeps the extra cost modest and produces consistent accuracy lifts on CIFAR-10/100, ImageNet, and some fine-tuning tasks, plus extra robustness to label noise. The code release is a plus; anyone can plug it in and see the numbers move.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sharpness-Aware Minimization (SAM), a min-max optimization procedure that seeks parameters lying in neighborhoods of uniformly low loss to simultaneously minimize training loss value and loss sharpness. The inner maximization over perturbations of size at most rho is approximated via a single gradient-ascent step, after which the outer minimization is performed with gradient descent. Empirical results on CIFAR-10, CIFAR-100, ImageNet, and finetuning tasks show consistent generalization improvements and new state-of-the-art performance for several models, plus robustness to label noise comparable to specialized methods.

Significance. If the reported gains are reproducible under standard controls, SAM offers a practical, geometry-motivated regularizer that improves generalization in overparameterized models without requiring architectural changes. The open-sourced implementation is a clear strength that enables direct verification and extension.

major comments (2)

[§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.
[§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.

minor comments (2)

[§3] Notation for the perturbation step size and projection is introduced without an explicit equation reference in the main text; adding a numbered display equation would improve clarity.
[§4] The abstract states 'novel state-of-the-art performance for several' tasks; the main text should list the exact prior SOTA numbers and the precise margins achieved for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.

Authors: We acknowledge that the single-step gradient ascent approximation to the inner maximization lacks formal error bounds, which would be difficult to derive in the non-convex high-dimensional setting. This choice prioritizes computational efficiency, consistent with one-step approximations commonly used in related min-max problems such as adversarial training. To address the concern, we will add explicit empirical validation in the revised manuscript: we will compare the single-step surrogate sharpness to multi-step (e.g., 5-10 step) approximations on smaller models and subsets of CIFAR-10, showing strong correlation between the surrogate and true worst-case loss within the rho-ball, as well as alignment with observed generalization gains. revision: partial
Referee: [§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.

Authors: We agree that additional statistical details are necessary for assessing robustness. In the revised version, we will report all main results as averages over at least 3 independent random seeds, including standard deviations. We will also add paired statistical significance tests (e.g., t-tests) against baselines to confirm the improvements. On the question of implicit regularization from the perturbation step, our existing ablations (random perturbation baselines and varying rho) already help isolate the geometric effect; we will expand this discussion and add further controls in the revision to more directly address this alternative explanation. revision: yes

Circularity Check

0 steps flagged

SAM min-max objective directly encodes neighborhood-loss goal by definition; no reduction to fitted inputs or self-citation chains

full rationale

The paper's core formulation defines SAM explicitly as the procedure that minimizes the worst-case loss inside a rho-neighborhood, yielding the min-max problem without any intermediate derivation that collapses back to a fitted parameter or prior result by construction. Empirical gains on CIFAR/ImageNet are reported as separate validation rather than forced by the equations themselves. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain for the central claim. The single-step inner-max approximation is an efficiency choice, not a circular step. This yields a low but non-zero score reflecting the definitional nature of the objective while preserving independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that loss sharpness predicts generalization; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Parameters lying in neighborhoods of uniformly low loss generalize better than those at sharp minima
Explicitly motivated by prior work connecting loss geometry and generalization; invoked to justify the min-max objective.

pith-pipeline@v0.9.0 · 5500 in / 1113 out tokens · 20832 ms · 2026-05-16T20:10:54.430679+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motivated by prior work connecting the geometry of the loss landscape and generalization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Estimating Implicit Regularization in Deep Learning
stat.ML 2026-05 unverdicted novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow
physics.plasm-ph 2026-05 unverdicted novelty 7.0

A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
cs.LG 2026-04 conditional novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
cs.LG 2026-05 unverdicted novelty 6.0

TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...
Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions
cs.CV 2026-05 unverdicted novelty 6.0

A diffusion-based synthetic data pipeline using inpainting and OOD post-selection improves long-tail skin lesion classification on ISIC2019, delivering over 28% accuracy gain on the rarest class.
Geometric and Spectral Alignment for Deep Neural Network II
cs.LG 2026-05 unverdicted novelty 6.0

The work establishes margin-verified certificates for physical alignment of residual Jacobian chains by bounding truncation errors and decomposing the Physical Alignment Matrix orthogonally under fitted effective-rank...
Generalization at the Edge of Stability
cs.LG 2026-04 unverdicted novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
cs.CV 2026-04 unverdicted novelty 6.0

LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
Robust Policy Optimization to Prevent Catastrophic Forgetting
cs.LG 2026-02 unverdicted novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
cs.LG 2026-05 unverdicted novelty 5.0

MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC be...
Secure and Privacy-Preserving Vertical Federated Learning
cs.CR 2026-04 unverdicted novelty 5.0

Three optimized MPC protocols for privacy-preserving vertical federated learning that support global and global-local updates while reducing computation versus naive full-MPC delegation.
A Faster Path to Continual Learning
cs.LG 2026-04 unverdicted novelty 5.0

C-Flat Turbo accelerates continual learning by skipping redundant flatness gradients via direction-invariance observations and linear adaptive scheduling, delivering 1-1.25x speedup with comparable accuracy.
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
cs.LG 2026-04 unverdicted novelty 5.0

A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
cs.CV 2026-04 unverdicted novelty 5.0

MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...
FedNSAM:Consistency of Local and Global Flatness for Federated Learning
cs.LG 2026-02 unverdicted novelty 4.0

FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 17 Pith papers · 27 internal anchors

[1]

URL https://openreview.net/forum? id=BJl6t64tvr. 8https://github.com/google/spectral-density 9https://github.com/davda54/sam 9 Published as a conference paper at ICLR 2021 James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs,

work page 2021
[2]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gradi- ent Descent Into Wide Valleys. arXiv e-prints, art. arXiv:1611.01838, November

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels

URL http: //arxiv.org/abs/1905.05040. Ekin Dogus Cubuk, Barret Zoph, Dandelion Man ´e, Vijay Vasudevan, and Quoc V . Le. Au- toaugment: Learning augmentation policies from data. CoRR, abs/1805.09501,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[5]

URL http://arxiv.org/abs/1805.09501. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Improved Regularization of Convolutional Neural Networks with Cutout

URL http://arxiv.org/abs/1708.04552. Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints, art. arXiv:2010.11929, October

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Shake-Shake regularization

URL http: //arxiv.org/abs/1705.07485. Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An Investigation into Neural Net Optimiza- tion via Hessian Eigenvalue Density. arXiv e-prints, art. arXiv:1901.10159, January

work page internal anchor Pith review Pith/arXiv arXiv 1901
[13]

Deep Pyramidal Residual Networks

URL http://arxiv.org/abs/1610.02915. Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Pr ¨ugel-Bennett, and Jonathon Hare. FMix: Enhancing Mixed Sample Data Augmentation. arXiv e-prints , art. arXiv:2002.12047, February

work page internal anchor Pith review Pith/arXiv arXiv 2002
[15]

Deep Residual Learning for Image Recognition

URL http://arxiv.org/abs/1512.03385. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Sepp Hochreiter and J ¨urgen Schmidhuber. Simplifying neural nets by discovering ﬂat minima. In Advances in neural information processing systems , pp. 529–536,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Flat minima

10 Published as a conference paper at ICLR 2021 Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42,

work page 2021
[17]

Deep Networks with Stochastic Depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep Networks with Stochastic Depth. arXiv e-prints, art. arXiv:1603.09382, March

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Huang, L

J. Huang, L. Qu, R. Jia, and B. Zhao. O2u-net: A simple noisy label detection approach for deep neural networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 3325–3333,

work page 2019
[19]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V . Le, Yonghui Wu, and Zhifeng Chen. GPipe: Ef- ﬁcient Training of Giant Neural Networks using Pipeline Parallelism. arXiv e-prints , art. arXiv:1811.06965, November

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wil- son. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv e-prints, art. arXiv:1803.05407, March

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras

URL http://arxiv.org/abs/ 1806.07572. Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The Break-Even Point on Optimization Trajectories of Deep Neural Networks. arXiv e-prints, art. arXiv:2002.09572, February

work page arXiv 2002
[25]

MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

URL http: //arxiv.org/abs/1712.05055. Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. arXiv e-prints, art. arXiv:1911.09781, November

work page internal anchor Pith review Pith/arXiv arXiv 1911
[26]

Fantastic generalization measures and where to ﬁnd them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. arXiv preprint arXiv:1912.02178,

work page arXiv 1912
[27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980, December

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Simon Kornblith, Jonathon Shlens, and Quoc V . Le. Do Better ImageNet Models Transfer Better? arXiv e-prints, art. arXiv:1805.08974, May

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Visualizing the Loss Landscape of Neural Nets

URL http://arxiv.org/abs/1712.09913. Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. CoRR, abs/1905.00397,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[31]

James Martens and Roger Grosse

URL http://arxiv.org/abs/1905.00397. James Martens and Roger Grosse. Optimizing Neural Networks with Kronecker-factored Approxi- mate Curvature. arXiv e-prints, art. arXiv:1503.05671, March

work page arXiv 1905
[32]

Pac-bayesian model averaging

11 Published as a conference paper at ICLR 2021 David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual confer- ence on Computational learning theory , pp. 164–170,

work page 2021
[34]

URL http://arxiv.org/abs/1601.04114. Y . E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR , 269:543–547,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng

URL https://ci.nii.ac.jp/ naid/10029946121/en/. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning

work page arXiv
[37]

Domain Adaptive Transfer Learning with Specialist Models

URL http://arxiv.org/abs/1811.07056. Eric Arazo Sanchez, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Un- supervised label noise modeling and loss correction. CoRR, abs/1904.11238,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[38]

Unsupervised Label Noise Modeling and Loss Correction

URL http://arxiv.org/abs/1904.11238. Nitish Shirish Keskar and Richard Socher. Improving Generalization Performance by Switching from Adam to SGD. arXiv e-prints, art. arXiv:1712.07628, December

work page internal anchor Pith review Pith/arXiv arXiv 1904
[39]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv e-prints, art. arXiv:1609.04836, September

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929–1958,

work page 1929
[41]

Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption

Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption. arXiv e-prints , art. arXiv:2006.05620, June

work page arXiv 2006
[42]

Mingxing Tan and Quoc V . Le. EfﬁcientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv e-prints, art. arXiv:1905.11946, May

work page internal anchor Pith review Pith/arXiv arXiv 1905
[43]

Circumventing Outliers of AutoAugment with Knowledge Distillation

Longhui Wei, An Xiao, Lingxi Xie, Xin Chen, Xiaopeng Zhang, and Qi Tian. Circumventing Outliers of AutoAugment with Knowledge Distillation. arXiv e-prints , art. arXiv:2003.11342, March

work page arXiv 2003
[45]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

URL http://arxiv.org/ abs/1708.07747. Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. CoRR, abs/1802.02375,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis

URL http://arxiv.org/abs/1802.02375. 12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146,

work page arXiv 2021
[47]

Wide Residual Networks

URL http://arxiv.org/abs/1605.07146. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Understanding deep learning requires rethinking generalization

URL http: //arxiv.org/abs/1611.03530. Fan Zhang, Meng Li, Guisheng Zhai, and Yizhao Liu. Multi-branch and multi-scale attention learn- ing for ﬁne-grained visual categorization,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. CoRR, abs/1805.07836,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

URL http://arxiv.org/abs/1805. 07836. 13 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 PAC B AYESIAN GENERALIZATION BOUND Below, we state a generalization bound based on sharpness. Theorem

work page 2021
[52]

The conditionLD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)] means that adding Gaussian perturbation should not decrease the test error

F or anyρ >0 and any distribution D, with probability 1−δ over the choice of the training setS∼ D, LD (w)≤ max ∥ϵ∥2≤ρ LS (w +ϵ) + √ k log ( 1 +∥w∥2 2 ρ2 ( 1 + √ log(n) k )2) + 4 log n δ + ˜O(1) n− 1 (4) wheren =|S|,k is the number of parameters and we assumed LD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)]. The conditionLD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)] means that addin...

work page 2020
[53]

We would then have σ∗ P 2 = σ2 Q +∥µP−µQ∥2 2/k

(5) Moreover, if P =N (µP,σ 2 PI) and Q =N (µQ,σ 2 QI), then the KL divergence can be written as follows: KL(P||Q) = 1 2 [kσ2 Q +∥µP−µQ∥2 2 σ2 P −k +k log ( σ2 P σ2 Q ) ] (6) Given a posterior standard deviationσQ, one could choose a prior standard deviationσP to minimize the above KL divergence and hence the generalization bound by taking the derivative1...

work page 2002
[54]

We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k))

Therefore, we have: σ2 Q +∥µP−µQ∥2 2/k≤ρ2 +∥w∥2 2/k≤ρ2(1 + exp(4n/k)) (7) We now consider the bound that corresponds toj =⌊1−k log((ρ2 +∥w∥2 2/k)/c)⌋. We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k)). Furthermore, for σ2 P =c exp((1−j)/k), we have: ρ2 +∥w∥2 2/k≤σ2 P≤ exp(1/k) ( ρ2 +∥w∥2 2/k ) (8) 10Despite the noncon...

work page 2021
[55]

For Fashion-MNIST, the auto-augmentation line correspond to cutout only

plus cutout (Devries & Taylor, 2017). For Fashion-MNIST, the auto-augmentation line correspond to cutout only. Table 5: Results on SVHN and Fashion-MNIST. SVHN Fashion-MNIST Model Augmentation SAM Baseline SAM Baseline Wide-ResNet-28-10 Basic 1.42±0.02 1.58±0.03 3.98±0.05 4.57±0.07 Wide-ResNet-28-10 Auto augment 0.99±0.01 1.14±0.04 3.61±0.06 3.86±0.14 Sha...

work page 2017
[56]

We train all models with weight decay 1e−5 as suggested in (Tan & Le, 2019), but we reduce the learning rate to 0.016 as the models tend to diverge for higher values

16 Published as a conference paper at ICLR 2021 Table 6: Hyper-parameter used to produce the CIFAR-{10,100} results CIFAR Dataset LR WD ρ (CIFAR-10) ρ (CIFAR-100) WRN 28-10 (200 epochs) 0.1 0.0005 0.05 0.1 WRN 28-10 (1800 epochs) 0.05 0.001 0.05 0.1 WRN 26-2x6 ShakeShake 0.02 0.0010 0.02 0.05 Pyramid vanilla 0.05 0.0005 0.05 0.2 Pyramid ShakeDrop (CIFAR-1...

work page 2021
[57]

However, when the model nears convergence, the similarity between 11We found anecdotal evidence that this makes the ﬁnetuning more robust to overtraining. 17 Published as a conference paper at ICLR 2021 20% 40% 60% 80% 0 15.0% 31.2% 52.3% 73.5% 0.01 13.7% 28.7% 50.1% 72.9% 0.02 12.8% 27.8% 48.9% 73.1% 0.05 11.6% 25.6% 47.1% 21.0% 0.1 4.6% 6.0% 8.7% 56.1% ...

work page 2021
[58]

We observe that adversarial perturbations outperform random perturbations, and that using p = 2 yield superior accuracy on this example. C.6 S EVERAL ITERATIONS IN THE INNER MAXIMIZATION To empirically verify that the linearization of the inner problem is sensible, we trained a WideResNet on the CIFAR datasets using a variant of SAM that performs several ...

work page 2021