pith. machine review for the scientific record. sign in

arxiv: 2010.01412 · v3 · submitted 2020-10-03 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Sharpness-Aware Minimization for Efficiently Improving Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords sharpness-aware minimizationgeneralizationloss landscapemin-max optimizationdeep learninglabel noiseimage classification
0
0 comments X

The pith

Sharpness-Aware Minimization finds parameters in flat loss neighborhoods to improve generalization over standard training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Overparameterized models can achieve low training loss yet still generalize poorly. The paper introduces Sharpness-Aware Minimization (SAM), a procedure that simultaneously minimizes loss value and loss sharpness by seeking parameters whose neighborhoods have uniformly low loss. This is expressed as a min-max optimization problem that gradient descent can solve efficiently. Experiments demonstrate that SAM improves generalization on CIFAR-10, CIFAR-100, ImageNet and finetuning tasks, reaching new state-of-the-art results on several benchmarks while also matching specialized methods in robustness to label noise.

Core claim

SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently, leading to improved generalization across benchmark datasets and models.

What carries the argument

The min-max objective that minimizes the maximum loss value inside a neighborhood of fixed radius around the current parameters, approximated via a first-order Taylor expansion for efficient gradient computation.

Load-bearing premise

That parameters whose neighborhoods have uniformly low loss will reliably generalize better than parameters found by minimizing training loss alone.

What would settle it

An experiment on a standard benchmark where SAM training produces worse test accuracy than standard gradient descent while using identical model size, data, and hyperparameter budgets.

read the original abstract

In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at \url{https://github.com/google-research/sam}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sharpness-Aware Minimization (SAM), a min-max optimization procedure that seeks parameters lying in neighborhoods of uniformly low loss to simultaneously minimize training loss value and loss sharpness. The inner maximization over perturbations of size at most rho is approximated via a single gradient-ascent step, after which the outer minimization is performed with gradient descent. Empirical results on CIFAR-10, CIFAR-100, ImageNet, and finetuning tasks show consistent generalization improvements and new state-of-the-art performance for several models, plus robustness to label noise comparable to specialized methods.

Significance. If the reported gains are reproducible under standard controls, SAM offers a practical, geometry-motivated regularizer that improves generalization in overparameterized models without requiring architectural changes. The open-sourced implementation is a clear strength that enables direct verification and extension.

major comments (2)
  1. [§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.
  2. [§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.
minor comments (2)
  1. [§3] Notation for the perturbation step size and projection is introduced without an explicit equation reference in the main text; adding a numbered display equation would improve clarity.
  2. [§4] The abstract states 'novel state-of-the-art performance for several' tasks; the main text should list the exact prior SOTA numbers and the precise margins achieved for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.

    Authors: We acknowledge that the single-step gradient ascent approximation to the inner maximization lacks formal error bounds, which would be difficult to derive in the non-convex high-dimensional setting. This choice prioritizes computational efficiency, consistent with one-step approximations commonly used in related min-max problems such as adversarial training. To address the concern, we will add explicit empirical validation in the revised manuscript: we will compare the single-step surrogate sharpness to multi-step (e.g., 5-10 step) approximations on smaller models and subsets of CIFAR-10, showing strong correlation between the surrogate and true worst-case loss within the rho-ball, as well as alignment with observed generalization gains. revision: partial

  2. Referee: [§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.

    Authors: We agree that additional statistical details are necessary for assessing robustness. In the revised version, we will report all main results as averages over at least 3 independent random seeds, including standard deviations. We will also add paired statistical significance tests (e.g., t-tests) against baselines to confirm the improvements. On the question of implicit regularization from the perturbation step, our existing ablations (random perturbation baselines and varying rho) already help isolate the geometric effect; we will expand this discussion and add further controls in the revision to more directly address this alternative explanation. revision: yes

Circularity Check

0 steps flagged

SAM min-max objective directly encodes neighborhood-loss goal by definition; no reduction to fitted inputs or self-citation chains

full rationale

The paper's core formulation defines SAM explicitly as the procedure that minimizes the worst-case loss inside a rho-neighborhood, yielding the min-max problem without any intermediate derivation that collapses back to a fitted parameter or prior result by construction. Empirical gains on CIFAR/ImageNet are reported as separate validation rather than forced by the equations themselves. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain for the central claim. The single-step inner-max approximation is an efficiency choice, not a circular step. This yields a low but non-zero score reflecting the definitional nature of the objective while preserving independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that loss sharpness predicts generalization; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Parameters lying in neighborhoods of uniformly low loss generalize better than those at sharp minima
    Explicitly motivated by prior work connecting loss geometry and generalization; invoked to justify the min-max objective.

pith-pipeline@v0.9.0 · 5500 in / 1113 out tokens · 20832 ms · 2026-05-16T20:10:54.430679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Estimating Implicit Regularization in Deep Learning

    stat.ML 2026-05 unverdicted novelty 7.0

    Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

  2. iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow

    physics.plasm-ph 2026-05 unverdicted novelty 7.0

    A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.

  3. When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

    cs.LG 2026-04 conditional novelty 7.0

    FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

  4. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  5. TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

    cs.LG 2026-05 unverdicted novelty 6.0

    TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...

  6. Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions

    cs.CV 2026-05 unverdicted novelty 6.0

    A diffusion-based synthetic data pipeline using inpainting and OOD post-selection improves long-tail skin lesion classification on ISIC2019, delivering over 28% accuracy gain on the rarest class.

  7. Geometric and Spectral Alignment for Deep Neural Network II

    cs.LG 2026-05 unverdicted novelty 6.0

    The work establishes margin-verified certificates for physical alignment of residual Jacobian chains by bounding truncation errors and decomposing the Physical Alignment Matrix orthogonally under fitted effective-rank...

  8. Generalization at the Edge of Stability

    cs.LG 2026-04 unverdicted novelty 6.0

    Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...

  9. Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

    cs.LG 2026-04 unverdicted novelty 6.0

    Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.

  10. LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...

  11. Robust Policy Optimization to Prevent Catastrophic Forgetting

    cs.LG 2026-02 unverdicted novelty 6.0

    FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

  12. MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization

    cs.LG 2026-05 unverdicted novelty 5.0

    MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC be...

  13. Secure and Privacy-Preserving Vertical Federated Learning

    cs.CR 2026-04 unverdicted novelty 5.0

    Three optimized MPC protocols for privacy-preserving vertical federated learning that support global and global-local updates while reducing computation versus naive full-MPC delegation.

  14. A Faster Path to Continual Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    C-Flat Turbo accelerates continual learning by skipping redundant flatness gradients via direction-invariance observations and linear adaptive scheduling, delivering 1-1.25x speedup with comparable accuracy.

  15. Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

    cs.LG 2026-04 unverdicted novelty 5.0

    A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.

  16. MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications

    cs.CV 2026-04 unverdicted novelty 5.0

    MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...

  17. FedNSAM:Consistency of Local and Global Flatness for Federated Learning

    cs.LG 2026-02 unverdicted novelty 4.0

    FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 17 Pith papers · 27 internal anchors

  1. [1]

    URL https://openreview.net/forum? id=BJl6t64tvr. 8https://github.com/google/spectral-density 9https://github.com/davda54/sam 9 Published as a conference paper at ICLR 2021 James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs,

  2. [2]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gradi- ent Descent Into Wide Valleys. arXiv e-prints, art. arXiv:1611.01838, November

  3. [4]

    Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels

    URL http: //arxiv.org/abs/1905.05040. Ekin Dogus Cubuk, Barret Zoph, Dandelion Man ´e, Vijay Vasudevan, and Quoc V . Le. Au- toaugment: Learning augmentation policies from data. CoRR, abs/1805.09501,

  4. [5]

    URL http://arxiv.org/abs/1805.09501. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09,

  5. [7]

    Improved Regularization of Convolutional Neural Networks with Cutout

    URL http://arxiv.org/abs/1708.04552. Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933,

  6. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints, art. arXiv:2010.11929, October

  7. [9]

    Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

    Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008,

  8. [11]

    Shake-Shake regularization

    URL http: //arxiv.org/abs/1705.07485. Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An Investigation into Neural Net Optimiza- tion via Hessian Eigenvalue Density. arXiv e-prints, art. arXiv:1901.10159, January

  9. [13]

    Deep Pyramidal Residual Networks

    URL http://arxiv.org/abs/1610.02915. Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Pr ¨ugel-Bennett, and Jonathon Hare. FMix: Enhancing Mixed Sample Data Augmentation. arXiv e-prints , art. arXiv:2002.12047, February

  10. [15]

    Deep Residual Learning for Image Recognition

    URL http://arxiv.org/abs/1512.03385. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Sepp Hochreiter and J ¨urgen Schmidhuber. Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems , pp. 529–536,

  11. [16]

    Flat minima

    10 Published as a conference paper at ICLR 2021 Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42,

  12. [17]

    Deep Networks with Stochastic Depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep Networks with Stochastic Depth. arXiv e-prints, art. arXiv:1603.09382, March

  13. [18]

    Huang, L

    J. Huang, L. Qu, R. Jia, and B. Zhao. O2u-net: A simple noisy label detection approach for deep neural networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 3325–3333,

  14. [19]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V . Le, Yonghui Wu, and Zhifeng Chen. GPipe: Ef- ficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv e-prints , art. arXiv:1811.06965, November

  15. [20]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,

  16. [21]

    Averaging Weights Leads to Wider Optima and Better Generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wil- son. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv e-prints, art. arXiv:1803.05407, March

  17. [23]

    Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras

    URL http://arxiv.org/abs/ 1806.07572. Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The Break-Even Point on Optimization Trajectories of Deep Neural Networks. arXiv e-prints, art. arXiv:2002.09572, February

  18. [25]

    MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

    URL http: //arxiv.org/abs/1712.05055. Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. arXiv e-prints, art. arXiv:1911.09781, November

  19. [26]

    Fantastic generalization measures and where to find them

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178,

  20. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980, December

  21. [28]

    Simon Kornblith, Jonathon Shlens, and Quoc V . Le. Do Better ImageNet Models Transfer Better? arXiv e-prints, art. arXiv:1805.08974, May

  22. [30]

    Visualizing the Loss Landscape of Neural Nets

    URL http://arxiv.org/abs/1712.09913. Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. CoRR, abs/1905.00397,

  23. [31]

    James Martens and Roger Grosse

    URL http://arxiv.org/abs/1905.00397. James Martens and Roger Grosse. Optimizing Neural Networks with Kronecker-factored Approxi- mate Curvature. arXiv e-prints, art. arXiv:1503.05671, March

  24. [32]

    Pac-bayesian model averaging

    11 Published as a conference paper at ICLR 2021 David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual confer- ence on Computational learning theory , pp. 164–170,

  25. [34]

    URL http://arxiv.org/abs/1601.04114. Y . E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR , 269:543–547,

  26. [35]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng

    URL https://ci.nii.ac.jp/ naid/10029946121/en/. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning

  27. [37]

    Domain Adaptive Transfer Learning with Specialist Models

    URL http://arxiv.org/abs/1811.07056. Eric Arazo Sanchez, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Un- supervised label noise modeling and loss correction. CoRR, abs/1904.11238,

  28. [38]

    Unsupervised Label Noise Modeling and Loss Correction

    URL http://arxiv.org/abs/1904.11238. Nitish Shirish Keskar and Richard Socher. Improving Generalization Performance by Switching from Adam to SGD. arXiv e-prints, art. arXiv:1712.07628, December

  29. [39]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv e-prints, art. arXiv:1609.04836, September

  30. [40]

    Dropout: a simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,

  31. [41]

    Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption

    Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption. arXiv e-prints , art. arXiv:2006.05620, June

  32. [42]

    Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv e-prints, art. arXiv:1905.11946, May

  33. [43]

    Circumventing Outliers of AutoAugment with Knowledge Distillation

    Longhui Wei, An Xiao, Lingxi Xie, Xin Chen, Xiaopeng Zhang, and Qi Tian. Circumventing Outliers of AutoAugment with Knowledge Distillation. arXiv e-prints , art. arXiv:2003.11342, March

  34. [45]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    URL http://arxiv.org/ abs/1708.07747. Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. CoRR, abs/1802.02375,

  35. [46]

    12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis

    URL http://arxiv.org/abs/1802.02375. 12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146,

  36. [47]

    Wide Residual Networks

    URL http://arxiv.org/abs/1605.07146. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530,

  37. [48]

    Understanding deep learning requires rethinking generalization

    URL http: //arxiv.org/abs/1611.03530. Fan Zhang, Meng Li, Guisheng Zhai, and Yizhao Liu. Multi-branch and multi-scale attention learn- ing for fine-grained visual categorization,

  38. [49]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,

  39. [50]

    Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. CoRR, abs/1805.07836,

  40. [51]

    URL http://arxiv.org/abs/1805. 07836. 13 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 PAC B AYESIAN GENERALIZATION BOUND Below, we state a generalization bound based on sharpness. Theorem

  41. [52]

    The conditionLD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)] means that adding Gaussian perturbation should not decrease the test error

    F or anyρ >0 and any distribution D, with probability 1−δ over the choice of the training setS∼ D, LD (w)≤ max ∥ϵ∥2≤ρ LS (w +ϵ) + √ k log ( 1 +∥w∥2 2 ρ2 ( 1 + √ log(n) k )2) + 4 log n δ + ˜O(1) n− 1 (4) wheren =|S|,k is the number of parameters and we assumed LD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)]. The conditionLD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)] means that addin...

  42. [53]

    We would then have σ∗ P 2 = σ2 Q +∥µP−µQ∥2 2/k

    (5) Moreover, if P =N (µP,σ 2 PI) and Q =N (µQ,σ 2 QI), then the KL divergence can be written as follows: KL(P||Q) = 1 2 [kσ2 Q +∥µP−µQ∥2 2 σ2 P −k +k log ( σ2 P σ2 Q ) ] (6) Given a posterior standard deviationσQ, one could choose a prior standard deviationσP to minimize the above KL divergence and hence the generalization bound by taking the derivative1...

  43. [54]

    We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k))

    Therefore, we have: σ2 Q +∥µP−µQ∥2 2/k≤ρ2 +∥w∥2 2/k≤ρ2(1 + exp(4n/k)) (7) We now consider the bound that corresponds toj =⌊1−k log((ρ2 +∥w∥2 2/k)/c)⌋. We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k)). Furthermore, for σ2 P =c exp((1−j)/k), we have: ρ2 +∥w∥2 2/k≤σ2 P≤ exp(1/k) ( ρ2 +∥w∥2 2/k ) (8) 10Despite the noncon...

  44. [55]

    For Fashion-MNIST, the auto-augmentation line correspond to cutout only

    plus cutout (Devries & Taylor, 2017). For Fashion-MNIST, the auto-augmentation line correspond to cutout only. Table 5: Results on SVHN and Fashion-MNIST. SVHN Fashion-MNIST Model Augmentation SAM Baseline SAM Baseline Wide-ResNet-28-10 Basic 1.42±0.02 1.58±0.03 3.98±0.05 4.57±0.07 Wide-ResNet-28-10 Auto augment 0.99±0.01 1.14±0.04 3.61±0.06 3.86±0.14 Sha...

  45. [56]

    We train all models with weight decay 1e−5 as suggested in (Tan & Le, 2019), but we reduce the learning rate to 0.016 as the models tend to diverge for higher values

    16 Published as a conference paper at ICLR 2021 Table 6: Hyper-parameter used to produce the CIFAR-{10,100} results CIFAR Dataset LR WD ρ (CIFAR-10) ρ (CIFAR-100) WRN 28-10 (200 epochs) 0.1 0.0005 0.05 0.1 WRN 28-10 (1800 epochs) 0.05 0.001 0.05 0.1 WRN 26-2x6 ShakeShake 0.02 0.0010 0.02 0.05 Pyramid vanilla 0.05 0.0005 0.05 0.2 Pyramid ShakeDrop (CIFAR-1...

  46. [57]

    However, when the model nears convergence, the similarity between 11We found anecdotal evidence that this makes the finetuning more robust to overtraining. 17 Published as a conference paper at ICLR 2021 20% 40% 60% 80% 0 15.0% 31.2% 52.3% 73.5% 0.01 13.7% 28.7% 50.1% 72.9% 0.02 12.8% 27.8% 48.9% 73.1% 0.05 11.6% 25.6% 47.1% 21.0% 0.1 4.6% 6.0% 8.7% 56.1% ...

  47. [58]

    We observe that adversarial perturbations outperform random perturbations, and that using p = 2 yield superior accuracy on this example. C.6 S EVERAL ITERATIONS IN THE INNER MAXIMIZATION To empirically verify that the linearization of the inner problem is sensible, we trained a WideResNet on the CIFAR datasets using a variant of SAM that performs several ...