Recognition: no theorem link
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3
The pith
Fixing the allowed loss increase rather than the parameter radius in sharpness-aware minimization removes gradient dominance and improves generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Loss-Equated SAM (LE-SAM) inverts the traditional SAM mechanism by replacing the fixed perturbation radius in parameter space with a fixed loss-space budget. This change effectively removes gradient-norm-dominated learning signals and shifts optimization toward curvature-dominated terms, resulting in improved generalization performance.
What carries the argument
The loss-equated adversarial perturbation, which bounds the worst-case loss increase by a fixed value instead of bounding the Euclidean distance in parameter space.
If this is right
- LE-SAM consistently outperforms both SAM and its existing variants on diverse benchmarks and tasks.
- The optimizer places greater weight on curvature information during each update step.
- The resulting minima produce stronger generalization without any increase in training cost.
- The same inversion principle applies across multiple vision and language tasks where SAM is currently used.
Where Pith is reading between the lines
- Loss-bounded perturbations could be substituted into other minimax formulations used for robustness or domain adaptation.
- An adaptive version that slowly tightens the loss budget during training might combine the benefits of both radius and loss views.
- The same idea invites direct comparison against second-order methods that explicitly estimate Hessian curvature.
Load-bearing premise
That fixing the loss-space budget for the perturbation directly removes gradient-norm effects and thereby shifts focus to curvature.
What would settle it
Training LE-SAM on a standard image-classification benchmark and finding test accuracy no higher than that of SAM would falsify the central claim.
Figures
read the original abstract
Sharpness-Aware Minimization (SAM) improves generalization by minimizing the worst-case loss within a fixed parameter-space radius neighborhood. SAM and its variants mainly rely on a first-order linearized surrogate, while flat minima are inherently a second-order (curvature) notion.We revisit this mismatch and propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM mechanism that fixed perturbation radius with a fixed loss-space budget,effectively removing gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms. Extensive experiments across diverse benchmarks and tasks demonstrate the strong generalization ability of LESAM that consistently outperforms SAM and even its variants, achieving the state-of-the-art performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Loss-Equated SAM (LE-SAM) as an inversion of standard Sharpness-Aware Minimization (SAM): instead of minimizing the worst-case loss inside a fixed parameter-space radius, it fixes a loss-space budget for the adversarial perturbation. This change is asserted to eliminate gradient-norm-dominated signals and emphasize curvature-dominated terms. Extensive experiments across benchmarks and tasks are reported to show that LE-SAM consistently outperforms SAM and its variants, reaching state-of-the-art generalization.
Significance. If the claimed mechanistic shift from gradient-norm to curvature dominance is rigorously derived and the empirical gains prove robust to controls for hyper-parameter tuning and implementation details, the work could refine the design of sharpness-aware optimizers and improve generalization bounds in deep learning. The empirical breadth is a potential strength, but the absence of a supporting derivation in the abstract leaves the central rationale unverified.
major comments (2)
- [Abstract] Abstract: the central claim that fixing a loss-space budget 'effectively removes gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms' is presented as an immediate consequence of the inversion, yet no equation, first-order approximation, or update rule for the perturbation (e.g., arg min_ε ||ε|| s.t. L(θ+ε)−L(θ)=constant or its linearization) is supplied. This derivation is load-bearing for the mechanistic explanation and must be provided before the curvature-shift rationale can be evaluated.
- [Abstract] Abstract / Experiments: the assertion of 'state-of-the-art performance' and 'strong generalization ability' is stated without reference to specific tables, metrics, error bars, or ablation controls for the loss-budget hyper-parameter. Without these details the empirical claim cannot be assessed for statistical significance or confounding factors.
minor comments (2)
- [Abstract] Inconsistent acronym usage: 'LE-SAM' and 'LESAM' appear interchangeably; standardize to one form throughout.
- [Abstract] The phrase 'inverts the traditional SAM mechanism' is used without a concise contrast equation or pseudocode showing how the new perturbation differs from the standard SAM ascent step.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We provide point-by-point responses to the major comments and are prepared to revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that fixing a loss-space budget 'effectively removes gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms' is presented as an immediate consequence of the inversion, yet no equation, first-order approximation, or update rule for the perturbation (e.g., arg min_ε ||ε|| s.t. L(θ+ε)−L(θ)=constant or its linearization) is supplied. This derivation is load-bearing for the mechanistic explanation and must be provided before the curvature-shift rationale can be evaluated.
Authors: The full manuscript in Section 3 derives the perturbation under the fixed loss budget. Using a first-order Taylor expansion, L(θ + ε) ≈ L(θ) + ∇L · ε = L(θ) + δ, leading to the minimal ||ε|| perturbation ε = (δ / ||∇L||^2) ∇L. This explicitly shows the inverse dependence on the gradient norm, diminishing gradient-norm dominance and highlighting curvature effects in higher-order terms. We will add a concise version of this approximation to the abstract in the revision. revision: yes
-
Referee: [Abstract] Abstract / Experiments: the assertion of 'state-of-the-art performance' and 'strong generalization ability' is stated without reference to specific tables, metrics, error bars, or ablation controls for the loss-budget hyper-parameter. Without these details the empirical claim cannot be assessed for statistical significance or confounding factors.
Authors: While the abstract is a high-level summary, the full manuscript details the empirical results in Tables 1-6, reporting mean performance metrics with standard deviations across multiple runs on various benchmarks, along with ablations for the loss-budget hyperparameter in Section 4. We will revise the abstract to include references to key tables and figures to support the state-of-the-art claim. revision: yes
Circularity Check
No circularity; proposal framed as independent inversion without equations or self-referential reductions.
full rationale
The provided abstract and description introduce LE-SAM by inverting SAM's fixed-radius perturbation to a fixed loss-space budget, asserting that this removes gradient-norm signals and emphasizes curvature. No equations, update rules, or derivations are supplied that would allow inspection for self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The text contains no self-citations at all, and the central claim is presented as a direct mechanistic consequence rather than a re-expression of prior fitted quantities or ansatzes. Per the rules, absence of any quotable reduction to inputs by construction means the derivation (such as it is) is self-contained; this is the expected honest non-finding when no load-bearing circular steps exist.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
Proceedings of the IEEE international conference on computer vision , pages=
Deeper, broader and artier domain generalization , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[3]
arXiv preprint arXiv:2402.15152 , year=
On the duality between sharpness-aware minimization and adversarial training , author=. arXiv preprint arXiv:2402.15152 , year=
-
[4]
Towards Deep Learning Models Resistant to Adversarial Attacks
Towards deep learning models resistant to adversarial attacks , author=. arXiv preprint arXiv:1706.06083 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=
Erm++: An improved baseline for domain generalization , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=
work page 2025
-
[6]
International Conference on Artificial Intelligence and Statistics , pages=
Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[7]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[8]
M. J. Kearns , title =
-
[9]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[10]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[11]
Suppressed for Anonymity , author=
-
[12]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[13]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[14]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Cr-sam: Curvature regularized sharpness-aware minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[15]
International Conference on Learning Representations , year=
Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=
-
[16]
Proceedings of the 38th International Conference on Machine Learning , pages =
ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
-
[17]
International Conference on Learning Representations , year=
Surrogate Gap Minimization Improves Sharpness-Aware Training , author=. International Conference on Learning Representations , year=
-
[18]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Li, Tao and Zhou, Pan and He, Zhengbao and Cheng, Xinwen and Huang, Xiaolin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =
work page 2024
-
[19]
Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization , author=. 2025 , eprint=
work page 2025
-
[20]
International Conference on Learning Representations , year=
Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , author=. International Conference on Learning Representations , year=
-
[21]
Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy , author=. 2024 , eprint=
work page 2024
-
[22]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
A Single-Step, Sharpness-Aware Minimization is All You Need to Achieve Efficient and Accurate Sparse Training , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[23]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[24]
Wide Residual Networks, in: Proceedings of the British Machine Vision Conference (BMVC), pp
Wide Residual Networks , author=. Proceedings of the British Machine Vision Conference (BMVC) , publisher=. 2016 , month=. doi:10.5244/C.30.87 , isbn=
-
[25]
Deep Pyramidal Residual Networks , year=
Han, Dongyoon and Kim, Jiwhan and Kim, Junmo , booktitle=. Deep Pyramidal Residual Networks , year=
-
[26]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[27]
ImageNet: A large-scale hierarchical image database , year=
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=
- [28]
-
[29]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , doi =
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Köpf, Andreas and Yang, Edward and DeVito, Zach and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and C...
-
[30]
Visualizing the Loss Landscape of Neural Nets , author=. 2018 , eprint=
work page 2018
-
[31]
arXiv preprint arXiv:1905.00313 , year=
Revisiting the Polyak step size , author=. arXiv preprint arXiv:1905.00313 , year=
-
[32]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
work page 2019
-
[33]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[34]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=
work page internal anchor Pith review arXiv
-
[35]
Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=
-
[36]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[37]
International conference on learning representations , year=
Exploring balanced feature spaces for representation learning , author=. International conference on learning representations , year=
-
[38]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Targeted supervised contrastive learning for long-tailed recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[39]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Improving calibration for long-tailed recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[40]
International Conference on Machine Learning , pages=
Feature directions matter: Long-tailed learning via rotated balanced representation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[41]
Advances in neural information processing systems , volume=
Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? , author=. Advances in neural information processing systems , volume=
-
[42]
arXiv preprint arXiv:2505.01660 , year=
Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification , author=. arXiv preprint arXiv:2505.01660 , year=
-
[43]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Contrastive learning based hybrid networks for long-tailed image classification , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[44]
Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , url =
Yao, Zhewei and Gholami, Amir and Lei, Qi and Keutzer, Kurt and Mahoney, Michael W , booktitle =. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , url =
-
[45]
Avron, Haim and Toledo, Sivan , title =. 2011 , issue_date =. doi:10.1145/1944345.1944349 , journal =
-
[46]
Some large-scale matrix computation problems , journal =. 1996 , issn =. doi:https://doi.org/10.1016/0377-0427(96)00018-0 , url =
-
[47]
Yao, Zhewei and Gholami, Amir and Keutzer, Kurt and Mahoney, Michael W. , booktitle=. PyHessian: Neural Networks Through the Lens of the Hessian , year=
-
[48]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Class-balanced loss based on effective number of samples , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[49]
Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training , author=. 2025 , eprint=
work page 2025
-
[50]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Jian, Zhongquan and Chen, Yanhao and Wang, Yancheng and Yao, Junfeng and Wang, Meihong and Wu, Qingqiang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =
work page 2025
-
[51]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Global and local mixture consistency cumulative learning for long-tailed visual recognitions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[52]
International conference on machine learning , pages=
Beyond synthetic noise: Deep learning on controlled noisy labels , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[53]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[54]
Advances in Neural Information Processing Systems , volume=
Adaptive sgd with polyak stepsize and line-search: Robust convergence and variance reduction , author=. Advances in Neural Information Processing Systems , volume=
-
[55]
arXiv preprint arXiv:2406.04142 , year=
Stochastic Polyak step-sizes and momentum: Convergence guarantees and practical performance , author=. arXiv preprint arXiv:2406.04142 , year=
-
[56]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
- [57]
-
[58]
Advances in Neural Information Processing Systems , editor=
Sharpness-Aware Training for Free , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[59]
Fisher SAM: Information Geometry and Sharpness Aware Minimisation , author=. 2022 , eprint=
work page 2022
-
[60]
Golub, Gene H. and van Loan, Charles F. , biburl =. Matrix Computations , url =
-
[61]
Simplifying neural nets by discovering flat minima , year =
Hochreiter, Sepp and Schmidhuber, J\". Simplifying neural nets by discovering flat minima , year =
-
[62]
International Conference on Learning Representations , year=
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=
-
[63]
Proceedings of the 34th International Conference on Machine Learning , pages =
Sharp Minima Can Generalize For Deep Nets , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[64]
Journal of Machine Learning Research , year =
John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =
-
[65]
Proceedings of Thirty Third Conference on Learning Theory , pages =
Complexity Guarantees for Polyak Steps with Momentum , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =
work page 2020
-
[66]
Exact Convergence rate of the subgradient method by using Polyak step size , author=. 2024 , eprint=
work page 2024
-
[67]
Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , author=. 2021 , eprint=
work page 2021
-
[68]
Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution , author=. 2024 , eprint=
work page 2024
-
[69]
A Universal Class of Sharpness-Aware Minimization Algorithms , author=. 2024 , eprint=
work page 2024
-
[70]
On the Duality Between Sharpness-Aware Minimization and Adversarial Training , author=. 2024 , eprint=
work page 2024
-
[71]
Sharpness-Aware Minimization Alone can Improve Adversarial Robustness , author=. 2023 , eprint=
work page 2023
-
[72]
Dong, Mingrong and Yang, Yixuan and Zeng, Kai and Wang, Qingwang and Shen, Tao , TITLE =. Remote Sensing , VOLUME =. 2024 , NUMBER =
work page 2024
-
[73]
Sharpness-Aware Minimization with Dynamic Reweighting , author=. 2022 , eprint=
work page 2022
-
[74]
Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.