arxiv: 2605.10183 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

Jinping Wang , Qinhan Liu , Zhiwu Xie , Zhiqiang Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords sharpness-aware minimizationadversarial perturbationloss spacegeneralizationflat minimacurvaturegradient norm

0 comments

The pith

Fixing the allowed loss increase rather than the parameter radius in sharpness-aware minimization removes gradient dominance and improves generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sharpness-Aware Minimization seeks flat minima by minimizing the worst-case loss inside a neighborhood of the current parameters. The standard approach fixes the size of that neighborhood as a radius in parameter space, which relies on a first-order approximation. The paper instead fixes the maximum loss increase allowed inside the neighborhood. This inversion removes the part of the signal that scales with gradient size and leaves terms that reflect loss curvature. A sympathetic reader would care because the change requires no extra computation yet produces models that perform better on held-out data across many tasks.

Core claim

Loss-Equated SAM (LE-SAM) inverts the traditional SAM mechanism by replacing the fixed perturbation radius in parameter space with a fixed loss-space budget. This change effectively removes gradient-norm-dominated learning signals and shifts optimization toward curvature-dominated terms, resulting in improved generalization performance.

What carries the argument

The loss-equated adversarial perturbation, which bounds the worst-case loss increase by a fixed value instead of bounding the Euclidean distance in parameter space.

If this is right

LE-SAM consistently outperforms both SAM and its existing variants on diverse benchmarks and tasks.
The optimizer places greater weight on curvature information during each update step.
The resulting minima produce stronger generalization without any increase in training cost.
The same inversion principle applies across multiple vision and language tasks where SAM is currently used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Loss-bounded perturbations could be substituted into other minimax formulations used for robustness or domain adaptation.
An adaptive version that slowly tightens the loss budget during training might combine the benefits of both radius and loss views.
The same idea invites direct comparison against second-order methods that explicitly estimate Hessian curvature.

Load-bearing premise

That fixing the loss-space budget for the perturbation directly removes gradient-norm effects and thereby shifts focus to curvature.

What would settle it

Training LE-SAM on a standard image-classification benchmark and finding test accuracy no higher than that of SAM would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10183 by Jinping Wang, Qinhan Liu, Zhiqiang Gao, Zhiwu Xie.

**Figure 1.** Figure 1: Illustration of loss-equated adversarial perturbation. LE-SAM fixes a budget in loss space and back-solves the corresponding adversarial perturbation radius in parameter space. This mechanism removes gradient norm dominated effects and shifts the optimization signal toward curvature-dominated (second-order) information. The shaded region indicates the excess loss beyond the budget due to the curvature ter… view at source ↗

**Figure 2.** Figure 2: Mechanism illustration of SAM vs. LE-SAM. Standard SAM fixes a perturbation radius in the parameter space and directly maps it to a loss increase dominated by the gradient norm. In contrast, LE-SAM fixes a budget in the loss space and back-solves the perturbation radius in the parameter space. Mapping back to the loss space removes the first-order gradient-dominated term and yields a second-order, curvatur… view at source ↗

**Figure 3.** Figure 3: (a) shows the distribution of top eigenvalues and the trace of Hessian at epoch 100 and 200 on CIFAR-100 with SAM, and our LE-SAM. (b) track the LPF value for both SAM and LE-SAM across 200 epochs SAM on CIFAR datasets, and empirically demonstrate our loss-equated mechanism can lead to a better generalization performance in a transfer learning scenario. 5.3. Flatness Metrics Eq. 13 shows that our loss-equa… view at source ↗

**Figure 5.** Figure 5: Perturbation radius across training process and the perturbation radius decreases, when optimization approaches a minimizer and gradients shrink again, ρ rises at the late stage of training. Beyond Mainly Late-Stage Gains of SAM Recent work on SAM points out that SAM works mainly in the late stage of training (Zhou et al., 2025). This phenomenon occurs since the practical surrogate of SAM contains a domina… view at source ↗

**Figure 4.** Figure 4: Visualization of loss landscape for SGD, SAM, and our LE-SAM 0 25 50 75 100 125 150 175 200 Epoch 0.0 0.1 0.2 0.3 0.4 P e r t u r b a t i o n R a d i u s Batch-wise Min-Max Mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Test accuracy trajectories under whole run vs. latestage switch training. We compare enabling SAM and LE-SAM throughout training against activating them after epoch 160 (SGD then SAM, SGD then LE-SAM). We apply all the methods to CIFAR-100 with ResNet-18. Model SAM LE-SAM LE-SAM+ ResNet-18 80.17 81.51 82.01 WRN-28-10 83.42 84.91 85.06 PyramidNet-110 84.46 85.91 86.19 [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of loss budget σ [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Sharpness-Aware Minimization (SAM) improves generalization by minimizing the worst-case loss within a fixed parameter-space radius neighborhood. SAM and its variants mainly rely on a first-order linearized surrogate, while flat minima are inherently a second-order (curvature) notion.We revisit this mismatch and propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM mechanism that fixed perturbation radius with a fixed loss-space budget,effectively removing gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms. Extensive experiments across diverse benchmarks and tasks demonstrate the strong generalization ability of LESAM that consistently outperforms SAM and even its variants, achieving the state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LE-SAM flips SAM's fixed-radius perturbation to a fixed loss budget and claims better generalization, but the rationale for the curvature shift lacks any derivation or update rule.

read the letter

The main takeaway is that this paper proposes Loss-Equated SAM by replacing the usual fixed-radius neighborhood with a fixed loss increase for the perturbation. It reports that this change produces stronger generalization than SAM and its variants on a range of benchmarks and tasks, reaching what they call state-of-the-art results in some cases. That inversion is the core new element relative to the SAM literature they cite. Prior work keeps the perturbation inside a parameter-space ball; here the budget is defined in loss space instead. The experiments are the part that could actually matter to people. They run across diverse tasks and claim consistent wins, which is the kind of evidence that would interest practitioners who already use SAM or its close relatives. If the controls are clean and the gains hold up under reimplementation, the empirical side has practical value. The weak part is the explanation for why the change should work. The abstract states that fixing the loss budget removes gradient-norm signals and shifts the optimization toward curvature terms, but it supplies no equation for the perturbation, no first-order approximation, and no derivation of the resulting update. The stress-test note correctly flags this gap: without showing how the ascent direction is found or confirming that it decouples from gradient magnitude, the claimed mechanistic advantage remains an assertion rather than a demonstrated consequence. If the full paper contains the missing math and ablations, that would strengthen the case; otherwise the results stand alone without a clear link to the stated rationale. This work is aimed at researchers already familiar with sharpness-aware methods and looking for optimizer tweaks. A reader focused on empirical improvements in deep-net training could get something useful from the numbers, provided the code and details are available for checking. The paper shows enough concrete claims and experimental scope to deserve a serious referee rather than a desk rejection, even though the theoretical justification needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Loss-Equated SAM (LE-SAM) as an inversion of standard Sharpness-Aware Minimization (SAM): instead of minimizing the worst-case loss inside a fixed parameter-space radius, it fixes a loss-space budget for the adversarial perturbation. This change is asserted to eliminate gradient-norm-dominated signals and emphasize curvature-dominated terms. Extensive experiments across benchmarks and tasks are reported to show that LE-SAM consistently outperforms SAM and its variants, reaching state-of-the-art generalization.

Significance. If the claimed mechanistic shift from gradient-norm to curvature dominance is rigorously derived and the empirical gains prove robust to controls for hyper-parameter tuning and implementation details, the work could refine the design of sharpness-aware optimizers and improve generalization bounds in deep learning. The empirical breadth is a potential strength, but the absence of a supporting derivation in the abstract leaves the central rationale unverified.

major comments (2)

[Abstract] Abstract: the central claim that fixing a loss-space budget 'effectively removes gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms' is presented as an immediate consequence of the inversion, yet no equation, first-order approximation, or update rule for the perturbation (e.g., arg min_ε ||ε|| s.t. L(θ+ε)−L(θ)=constant or its linearization) is supplied. This derivation is load-bearing for the mechanistic explanation and must be provided before the curvature-shift rationale can be evaluated.
[Abstract] Abstract / Experiments: the assertion of 'state-of-the-art performance' and 'strong generalization ability' is stated without reference to specific tables, metrics, error bars, or ablation controls for the loss-budget hyper-parameter. Without these details the empirical claim cannot be assessed for statistical significance or confounding factors.

minor comments (2)

[Abstract] Inconsistent acronym usage: 'LE-SAM' and 'LESAM' appear interchangeably; standardize to one form throughout.
[Abstract] The phrase 'inverts the traditional SAM mechanism' is used without a concise contrast equation or pseudocode showing how the new perturbation differs from the standard SAM ascent step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We provide point-by-point responses to the major comments and are prepared to revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that fixing a loss-space budget 'effectively removes gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms' is presented as an immediate consequence of the inversion, yet no equation, first-order approximation, or update rule for the perturbation (e.g., arg min_ε ||ε|| s.t. L(θ+ε)−L(θ)=constant or its linearization) is supplied. This derivation is load-bearing for the mechanistic explanation and must be provided before the curvature-shift rationale can be evaluated.

Authors: The full manuscript in Section 3 derives the perturbation under the fixed loss budget. Using a first-order Taylor expansion, L(θ + ε) ≈ L(θ) + ∇L · ε = L(θ) + δ, leading to the minimal ||ε|| perturbation ε = (δ / ||∇L||^2) ∇L. This explicitly shows the inverse dependence on the gradient norm, diminishing gradient-norm dominance and highlighting curvature effects in higher-order terms. We will add a concise version of this approximation to the abstract in the revision. revision: yes
Referee: [Abstract] Abstract / Experiments: the assertion of 'state-of-the-art performance' and 'strong generalization ability' is stated without reference to specific tables, metrics, error bars, or ablation controls for the loss-budget hyper-parameter. Without these details the empirical claim cannot be assessed for statistical significance or confounding factors.

Authors: While the abstract is a high-level summary, the full manuscript details the empirical results in Tables 1-6, reporting mean performance metrics with standard deviations across multiple runs on various benchmarks, along with ablations for the loss-budget hyperparameter in Section 4. We will revise the abstract to include references to key tables and figures to support the state-of-the-art claim. revision: yes

Circularity Check

0 steps flagged

No circularity; proposal framed as independent inversion without equations or self-referential reductions.

full rationale

The provided abstract and description introduce LE-SAM by inverting SAM's fixed-radius perturbation to a fixed loss-space budget, asserting that this removes gradient-norm signals and emphasizes curvature. No equations, update rules, or derivations are supplied that would allow inspection for self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The text contains no self-citations at all, and the central claim is presented as a direct mechanistic consequence rather than a re-expression of prior fitted quantities or ansatzes. Per the rules, absence of any quotable reduction to inputs by construction means the derivation (such as it is) is self-contained; this is the expected honest non-finding when no load-bearing circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the proposal rests on the unelaborated claim that loss-budget inversion removes gradient-norm signals.

pith-pipeline@v0.9.0 · 5418 in / 975 out tokens · 43963 ms · 2026-05-12T04:14:06.892847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 2 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

Proceedings of the IEEE international conference on computer vision , pages=

Deeper, broader and artier domain generalization , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[3]

arXiv preprint arXiv:2402.15152 , year=

On the duality between sharpness-aware minimization and adversarial training , author=. arXiv preprint arXiv:2402.15152 , year=

work page arXiv
[4]

Towards Deep Learning Models Resistant to Adversarial Attacks

Towards deep learning models resistant to adversarial attacks , author=. arXiv preprint arXiv:1706.06083 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Erm++: An improved baseline for domain generalization , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[6]

International Conference on Artificial Intelligence and Statistics , pages=

Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[7]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[8]

M. J. Kearns , title =

work page
[9]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[10]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[11]

Suppressed for Anonymity , author=

work page
[12]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[13]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[14]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Cr-sam: Curvature regularized sharpness-aware minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[15]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[16]

Proceedings of the 38th International Conference on Machine Learning , pages =

ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[17]

International Conference on Learning Representations , year=

Surrogate Gap Minimization Improves Sharpness-Aware Training , author=. International Conference on Learning Representations , year=

work page
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Li, Tao and Zhou, Pan and He, Zhengbao and Cheng, Xinwen and Huang, Xiaolin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[19]

2025 , eprint=

Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization , author=. 2025 , eprint=

work page 2025
[20]

International Conference on Learning Representations , year=

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , author=. International Conference on Learning Representations , year=

work page
[21]

2024 , eprint=

Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy , author=. 2024 , eprint=

work page 2024
[22]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

A Single-Step, Sharpness-Aware Minimization is All You Need to Achieve Efficient and Accurate Sparse Training , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[23]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[24]

Wide Residual Networks, in: Proceedings of the British Machine Vision Conference (BMVC), pp

Wide Residual Networks , author=. Proceedings of the British Machine Vision Conference (BMVC) , publisher=. 2016 , month=. doi:10.5244/C.30.87 , isbn=

work page doi:10.5244/c.30.87 2016
[25]

Deep Pyramidal Residual Networks , year=

Han, Dongyoon and Kim, Jiwhan and Kim, Junmo , booktitle=. Deep Pyramidal Residual Networks , year=

work page
[26]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[27]

ImageNet: A large-scale hierarchical image database , year=

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=

work page
[28]

1987 , publisher=

Introduction to optimization , author=. 1987 , publisher=

work page 1987
[29]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , doi =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Köpf, Andreas and Yang, Edward and DeVito, Zach and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and C...

work page
[30]

2018 , eprint=

Visualizing the Loss Landscape of Neural Nets , author=. 2018 , eprint=

work page 2018
[31]

arXiv preprint arXiv:1905.00313 , year=

Revisiting the Polyak step size , author=. arXiv preprint arXiv:1905.00313 , year=

work page arXiv 1905
[32]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[33]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[34]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

work page internal anchor Pith review arXiv
[35]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

work page Pith review arXiv
[36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[37]

International conference on learning representations , year=

Exploring balanced feature spaces for representation learning , author=. International conference on learning representations , year=

work page
[38]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Targeted supervised contrastive learning for long-tailed recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[39]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improving calibration for long-tailed recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[40]

International Conference on Machine Learning , pages=

Feature directions matter: Long-tailed learning via rotated balanced representation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[41]

Advances in neural information processing systems , volume=

Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? , author=. Advances in neural information processing systems , volume=

work page
[42]

arXiv preprint arXiv:2505.01660 , year=

Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification , author=. arXiv preprint arXiv:2505.01660 , year=

work page arXiv
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Contrastive learning based hybrid networks for long-tailed image classification , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[44]

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , url =

Yao, Zhewei and Gholami, Amir and Lei, Qi and Keutzer, Kurt and Mahoney, Michael W , booktitle =. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , url =

work page
[45]

Avron and S

Avron, Haim and Toledo, Sivan , title =. 2011 , issue_date =. doi:10.1145/1944345.1944349 , journal =

work page doi:10.1145/1944345.1944349 2011
[46]

1996 , issn =

Some large-scale matrix computation problems , journal =. 1996 , issn =. doi:https://doi.org/10.1016/0377-0427(96)00018-0 , url =

work page doi:10.1016/0377-0427(96)00018-0 1996
[47]

, booktitle=

Yao, Zhewei and Gholami, Amir and Keutzer, Kurt and Mahoney, Michael W. , booktitle=. PyHessian: Neural Networks Through the Lens of the Hessian , year=

work page
[48]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Class-balanced loss based on effective number of samples , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[49]

2025 , eprint=

Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training , author=. 2025 , eprint=

work page 2025
[50]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Jian, Zhongquan and Chen, Yanhao and Wang, Yancheng and Yao, Junfeng and Wang, Meihong and Wu, Qingqiang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

work page 2025
[51]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Global and local mixture consistency cumulative learning for long-tailed visual recognitions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[52]

International conference on machine learning , pages=

Beyond synthetic noise: Deep learning on controlled noisy labels , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[53]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[54]

Advances in Neural Information Processing Systems , volume=

Adaptive sgd with polyak stepsize and line-search: Robust convergence and variance reduction , author=. Advances in Neural Information Processing Systems , volume=

work page
[55]

arXiv preprint arXiv:2406.04142 , year=

Stochastic Polyak step-sizes and momentum: Convergence guarantees and practical performance , author=. arXiv preprint arXiv:2406.04142 , year=

work page arXiv
[56]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[57]

Momentum-

Marlon Becker and Frederick Altrock and Benjamin Risse , booktitle=. Momentum-. 2025 , url=

work page 2025
[58]

Advances in Neural Information Processing Systems , editor=

Sharpness-Aware Training for Free , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[59]

2022 , eprint=

Fisher SAM: Information Geometry and Sharpness Aware Minimisation , author=. 2022 , eprint=

work page 2022
[60]

and van Loan, Charles F

Golub, Gene H. and van Loan, Charles F. , biburl =. Matrix Computations , url =

work page
[61]

Simplifying neural nets by discovering flat minima , year =

Hochreiter, Sepp and Schmidhuber, J\". Simplifying neural nets by discovering flat minima , year =

work page
[62]

International Conference on Learning Representations , year=

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=

work page
[63]

Proceedings of the 34th International Conference on Machine Learning , pages =

Sharp Minima Can Generalize For Deep Nets , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[64]

Journal of Machine Learning Research , year =

John Duchi and Elad Hazan and Yoram Singer , title =. Journal of Machine Learning Research , year =

work page
[65]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Complexity Guarantees for Polyak Steps with Momentum , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

work page 2020
[66]

2024 , eprint=

Exact Convergence rate of the subgradient method by using Polyak step size , author=. 2024 , eprint=

work page 2024
[67]

2021 , eprint=

Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , author=. 2021 , eprint=

work page 2021
[68]

2024 , eprint=

Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution , author=. 2024 , eprint=

work page 2024
[69]

2024 , eprint=

A Universal Class of Sharpness-Aware Minimization Algorithms , author=. 2024 , eprint=

work page 2024
[70]

2024 , eprint=

On the Duality Between Sharpness-Aware Minimization and Adversarial Training , author=. 2024 , eprint=

work page 2024
[71]

2023 , eprint=

Sharpness-Aware Minimization Alone can Improve Adversarial Robustness , author=. 2023 , eprint=

work page 2023
[72]

Remote Sensing , VOLUME =

Dong, Mingrong and Yang, Yixuan and Zeng, Kai and Wang, Qingwang and Shen, Tao , TITLE =. Remote Sensing , VOLUME =. 2024 , NUMBER =

work page 2024
[73]

2022 , eprint=

Sharpness-Aware Minimization with Dynamic Reweighting , author=. 2022 , eprint=

work page 2022
[74]

2024 , eprint=

Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term , author=. 2024 , eprint=

work page 2024