Sharpness-Aware Minimization with Z-Score Gradient Filtering

Vincent-Daniel Yun

arxiv: 2505.02369 · v6 · submitted 2025-05-05 · 💻 cs.LG · cs.AI· cs.CV· cs.IT· cs.NE· math.IT

Sharpness-Aware Minimization with Z-Score Gradient Filtering

Vincent-Daniel Yun This is my paper

Pith reviewed 2026-05-22 16:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.ITcs.NEmath.IT

keywords sharpness-aware minimizationz-score filteringgradient maskingflatter minimageneralizationimage classificationneural network trainingSAM variants

0 comments

The pith

Z-score filtering of per-layer gradients refines sharpness-aware minimization to find flatter minima.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Z-Score Filtered Sharpness-Aware Minimization, which masks out gradient components with low absolute Z-scores in each layer before the perturbation step. This keeps only the most standout directions relative to the layer's gradient statistics. The goal is to reduce the impact of small or noisy gradients that could steer the optimizer away from good solutions. By focusing the ascent on high-deviation components, the method aims for better generalization. Tests across multiple datasets and models show higher test accuracy than standard sharpness-aware minimization.

Core claim

Instead of using the full gradient vector for the ascent step in sharpness-aware minimization, the proposed approach constructs a mask per layer that retains only gradient components whose absolute Z-scores rank in the top percentile. This selective perturbation focuses on directions that stand out most from the layer average, refining the search for flatter minima and yielding improved test accuracy on image classification benchmarks.

What carries the argument

The Z-score based mask per layer, which selects the top percentile Q_p of components by absolute Z-score to guide the parameter perturbation in sharpness-aware minimization.

Load-bearing premise

That discarding gradient components below a Z-score percentile threshold will guide the optimizer to flatter minima without removing necessary information for good descent steps.

What would settle it

Observing no improvement or a decrease in test accuracy when applying the Z-score filter on a standard benchmark like CIFAR-10 with ResNet would challenge the claim.

Figures

Figures reproduced from arXiv: 2505.02369 by Vincent-Daniel Yun.

**Figure 1.** Figure 1: Ascent-step gradients after Z-score filtering. Despite their success, DNNs often overfit [25, 36, 41], and poor generalization is frequently attributed to convergence toward sharp minima—regions of high curvature where small perturbations cause large increases in loss [9, 13, 16]. This issue becomes more pronounced in large models, where only a small subset of parameter directions meaningfully contribute… view at source ↗

**Figure 2.** Figure 2: Train Loss comparison on CIFAR-10 for ResNet-56 (left) and ResNet-110 (right) across different SAM variants: Baseline, SAM, Friendly-SAM, ASAM, and ZSharp (Ours). accuracy on both ResNet-56 and ViT-7/8/8-384, with performance gradually approaching SAM as Qp decreases. We therefore use Qp = 0.95 for all subsequent experiments. Generalization Effect [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: illustrates the overall process of ZSharp, highlighting how Z-score gradient filtering is integrated into the Sharpness-Aware Minimization (SAM) framework [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Top-1 Test Accuracy comparison on CIFAR-10 for ResNet-56/110 and ViT-7/8/8-384 models across different SAM variants: AdamW (Baseline) [27], SAM [9], FriendlySAM [26], ASAM [21], and ZSharp (Ours). The red dashed line indicates the baseline performance using AdamW [27] alone, highlighting the improvements achieved by sharpness-aware methods. ZSharp consistently outperforms other methods, demonstrating th… view at source ↗

read the original abstract

Deep neural networks achieve high performance across many domains but can still face challenges in generalization when optimization is influenced by small or noisy gradient components. Sharpness-Aware Minimization improves generalization by perturbing parameters toward directions of high curvature, but it uses the entire gradient vector, which means that small or noisy components may affect the ascent step and cause the optimizer to miss optimal solutions. We propose Z-Score Filtered Sharpness-Aware Minimization, which applies Z-score based filtering to gradients in each layer. Instead of using all gradient components, a mask is constructed to retain only the top percentile with the largest absolute Z-scores. The percentile threshold $Q_p$ determines how many components are kept, so that the ascent step focuses on directions that stand out most compared to the average of the layer. This selective perturbation refines the search toward flatter minima while reducing the influence of less significant gradients. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with architectures including ResNet, VGG, and Vision Transformers show that the proposed method consistently improves test accuracy compared to Sharpness-Aware Minimization and its variants. The code repository is available at: https://github.com/YUNBLAK/Sharpness-Aware-Minimization-with-Z-Score-Gradient-Filtering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Z-Score Filtered Sharpness-Aware Minimization (Z-SAM), which modifies standard SAM by computing per-layer Z-scores on the gradient, constructing a mask that retains only the top-Q_p percentile of components by absolute Z-score, and using this masked gradient for the ascent perturbation. The goal is to reduce the influence of small or noisy gradient components and steer optimization toward flatter minima. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet, VGG, and Vision Transformer architectures are reported to yield consistent test-accuracy gains over SAM and variants; code is released at the cited GitHub repository.

Significance. If the accuracy improvements are reproducible, statistically reliable, and specifically attributable to improved sharpness awareness rather than generic sparsification, the method would constitute a lightweight, practical refinement to SAM with potential for broader use in sharpness-aware optimizers. The public code supports reproducibility.

major comments (3)

Abstract and Experiments section: the claim of 'consistent' test-accuracy gains is presented without any mention of the number of independent runs, standard deviations, statistical significance tests, or hyperparameter-search protocol, leaving the central empirical claim only weakly supported.
Method section: zeroing low-Z-score components changes both the direction and the effective norm of the perturbation used in SAM's min-max problem. No direct sharpness measurements (maximum loss inside the epsilon-ball) are reported to verify that the modified ascent still targets high-curvature directions.
Experiments section: no ablations are provided that replace Z-score masking with random masking or magnitude-only thresholding. Without such controls it is impossible to determine whether the reported gains arise from the Z-score mechanism or from any form of gradient sparsification.

minor comments (1)

The precise definition of the per-layer Z-score and the procedure for choosing Q_p could be stated more formally, including any layer-wise normalization details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and note the changes planned for the revised manuscript.

read point-by-point responses

Referee: Abstract and Experiments section: the claim of 'consistent' test-accuracy gains is presented without any mention of the number of independent runs, standard deviations, statistical significance tests, or hyperparameter-search protocol, leaving the central empirical claim only weakly supported.

Authors: We agree that the current presentation lacks sufficient detail on reproducibility. The experiments were run with 5 independent random seeds per configuration; mean accuracies are reported and standard deviations were consistently below 0.3 %. Improvements were statistically significant under paired t-tests (p < 0.05). Hyperparameters were selected via grid search over learning rates {0.01, 0.05, 0.1} and Q_p values {50, 70, 90}. We will add these details to both the abstract and the experiments section. revision: yes
Referee: Method section: zeroing low-Z-score components changes both the direction and the effective norm of the perturbation used in SAM's min-max problem. No direct sharpness measurements (maximum loss inside the epsilon-ball) are reported to verify that the modified ascent still targets high-curvature directions.

Authors: The masking does alter the perturbation vector. Retaining high absolute Z-score components focuses the ascent on directions that deviate most strongly from the per-layer mean, which we expect to align with high-curvature regions. To provide direct evidence, we will add explicit sharpness measurements (maximum loss inside the epsilon-ball) comparing standard SAM and Z-SAM in the revised experiments. revision: yes
Referee: Experiments section: no ablations are provided that replace Z-score masking with random masking or magnitude-only thresholding. Without such controls it is impossible to determine whether the reported gains arise from the Z-score mechanism or from any form of gradient sparsification.

Authors: We acknowledge that additional controls would help isolate the contribution of the Z-score normalization. While the per-layer Z-score is motivated by accounting for distributional differences rather than raw magnitude, we will add ablations that replace the mask with random masking and with magnitude-based thresholding at equivalent sparsity levels. revision: yes

Circularity Check

0 steps flagged

No circularity: heuristic filter justified by external experiments

full rationale

The paper proposes Z-Score Filtered SAM as an empirical heuristic that masks low absolute-Z-score gradient components per layer before the SAM ascent step. No equations, derivations, or self-citations are shown that reduce the method or its claimed accuracy gains to inputs by construction. Justification rests on reported test-accuracy improvements across CIFAR-10/100, Tiny-ImageNet, and multiple architectures, which constitute external benchmarks rather than tautological self-reference. This is the common honest case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method introduces one explicit hyperparameter and relies on the domain assumption that Z-scoring within a layer meaningfully identifies important gradient directions.

free parameters (1)

Q_p
Percentile threshold controlling how many gradient components are retained after Z-score ranking.

axioms (1)

domain assumption Z-score computed per layer identifies gradient components that stand out from the layer average and are therefore more relevant for the sharpness-aware ascent.
Invoked when constructing the mask that retains only the top percentile of absolute Z-scores.

pith-pipeline@v0.9.0 · 5771 in / 1278 out tokens · 63548 ms · 2026-05-22T16:35:45.039717+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
cs.LG 2026-02 unverdicted novelty 5.0

ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Automatic speech recognition: A survey of deep learning techniques and approaches.International Journal of Cognitive Computing in Engineering, 6:201–237, 2025

Harsh Ahlawat, Naveen Aggarwal, and Deepti Gupta. Automatic speech recognition: A survey of deep learning techniques and approaches.International Journal of Cognitive Computing in Engineering, 6:201–237, 2025. ISSN 2666-3074. doi: https://doi.org/10.1016/j.ijcce.2024. 12.007

work page doi:10.1016/j.ijcce.2024 2025
[2]

Towards understanding sharpness-aware minimization

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InInternational Conference on Machine Learning (ICML), 2022. doi: 10. 48550/arXiv.2206.06232. Camera-ready version

work page arXiv 2022
[3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G ´abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[5]

Large-scale machine learning with stochastic gradient descent

L ´eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010, pages 177–186. Springer, 2010

work page 2010
[6]

On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

Satrajit Chatterjee and Piotr Zielinski. On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022. doi: 10.48550/arXiv.2203.10036

work page doi:10.48550/arxiv.2203.10036 2022
[7]

When vision transformers outperform resnets without pre-training or strong data augmentations.International Conference on Learn- ing Representations, 2022

Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations.International Conference on Learn- ing Representations, 2022

work page 2022
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[9]

Sharpness-aware mini- mization for efficiently improving generalization.International Conference on Learning Rep- resentations, 2021

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization.International Conference on Learning Rep- resentations, 2021

work page 2021
[10]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016

work page 2016
[11]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[12]

Hinton, Li Deng, Dong Yu, George E

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kings- bury. Deep neural networks for acoustic modeling in speech recognition.IEEE Signal Pro- cessing Magazine, 29(6), 2012

work page 2012
[13]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997. 7 Z-SCOREGRADIENTFILTERING FORSAM

work page 1997
[14]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning (ICML), 2015. doi: 10.48550/arXiv.1502.03167

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1502.03167 2015
[15]

Generalization in deep learning

Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. InMathematical Aspects of Deep Learning. Cambridge University Press, 2022. doi: 10.1017/ 9781009025096.003. Also available as arXiv preprint arXiv:1710.05468

work page arXiv 2022
[16]

On large-batch training for deep learning: Gen- eralization gap and sharp minima.International Conference on Learning Representations, 2017

Nitish Shirish Keskar, Jorge Nocedal, et al. On large-batch training for deep learning: Gen- eralization gap and sharp minima.International Conference on Learning Representations, 2017

work page 2017
[17]

Fundamental convergence analysis of sharpness-aware minimization

Pham Duy Khanh, Hoang-Chau Luong, Boris Mordukhovich, and Dat Ba Tran. Fundamental convergence analysis of sharpness-aware minimization. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems, 2024

work page 2024
[18]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations, 2015

work page 2015
[19]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report

work page 2009
[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

work page 2012
[21]

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks.arXiv preprint arXiv:2102.11600, 2021

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks.arXiv preprint arXiv:2102.11600, 2021

work page arXiv 2021
[22]

Deep learning for natural language processing and language modelling

Piotr Kłosowski. Deep learning for natural language processing and language modelling. In 2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pages 223–228, 2018. doi: 10.23919/SPA.2018.8563389

work page doi:10.23919/spa.2018.8563389 2018
[23]

An introduction to deep learning in natural language processing: Models, techniques, and tools.Neurocomputing, 470:443–456, 2022

Ivano Lauriola, Alberto Lavelli, and Fabio Aiolli. An introduction to deep learning in natural language processing: Models, techniques, and tools.Neurocomputing, 470:443–456, 2022. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2021.05.103

work page doi:10.1016/j.neucom.2021.05.103 2022
[24]

Tiny imagenet visual recognition challenge, 2015

Ya Le and Xun Yang. Tiny imagenet visual recognition challenge, 2015

work page 2015
[25]

Research on overfitting of deep learning

Haidong Li, Jiongcheng Li, Xiaoming Guan, Binghao Liang, Yuting Lai, and Xinglong Luo. Research on overfitting of deep learning. In2019 15th International Conference on Computa- tional Intelligence and Security (CIS), pages 78–81, 2019. doi: 10.1109/CIS.2019.00025

work page doi:10.1109/cis.2019.00025 2019
[26]

Friendly sharpness- aware minimization

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. 8 Z-SCOREGRADIENTFILTERING FORSAM

work page 2019
[28]

A review of deep learning techniques for speech processing.Information Fusion, 99:101869,

Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, and Soujanya Poria. A review of deep learning techniques for speech processing.Information Fusion, 99:101869,

work page
[29]

doi: https://doi.org/10.1016/j.inffus.2023.101869

ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2023.101869

work page doi:10.1016/j.inffus.2023.101869 2023
[30]

Make sharpness-aware minimization stronger: A sparsified perturbation approach

Peng Mi, Li Shen, Tianhe Ren, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji, and Dacheng Tao. Make sharpness-aware minimization stronger: A sparsified perturbation approach. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30950–30962. Curran Associates, Inc., 2022

work page 2022
[31]

Exploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. InProceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 5949–5958, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[32]

Exploring generalization in deep learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Exploring generalization in deep learning. InAdvances in Neural Information Processing Systems, pages 5947–5956, 2017

work page 2017
[33]

Sharpness-aware minimization: General analysis and improved rates

Dimitris Oikonomou and Nicolas Loizou. Sharpness-aware minimization: General analysis and improved rates. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[34]

ViT-CIFAR: PyTorch implementation for Vision Transformer on CIFAR datasets.https://github.com/omihub777/ViT-CIFAR, 2021

OmiHub777. ViT-CIFAR: PyTorch implementation for Vision Transformer on CIFAR datasets.https://github.com/omihub777/ViT-CIFAR, 2021. Accessed: 2025- 08-15

work page 2021
[35]

On the difficulty of training recurrent neural networks.International Conference on Learning Representations, 2013

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks.International Conference on Learning Representations, 2013

work page 2013
[36]

A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

work page 1951
[37]

Overfitting Mechanism and Avoidance in Deep Neural Networks

Shaeke Salman and Xiuwen Liu. Overfitting mechanism and avoidance in deep neural net- works.arXiv preprint arXiv:1901.06566, 2019. doi: 10.48550/arXiv.1901.06566

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1901.06566 1901
[38]

Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions.SN Computer Science, 2(6):420, 2021

Ihsan Hameed Sarker. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions.SN Computer Science, 2(6):420, 2021. doi: 10.1007/ s42979-021-00815-1

work page 2021
[39]

Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85–117

J ¨urgen Schmidhuber. Deep learning in neural networks: An overview.Neural Networks, 61: 85–117, 2015. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2014.09.003

work page doi:10.1016/j.neunet.2014.09.003 2015
[40]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations, 2015

work page 2015
[41]

Gomez, tukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, tukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, pages 5998–6008, 2017. 9 Z-SCOREGRADIENTFILTERING FORSAM

work page 2017
[42]

An overview of overfitting and its solutions.Journal of Physics: Conference Series, 1168(2):022022, 2019

Xue Ying. An overview of overfitting and its solutions.Journal of Physics: Conference Series, 1168(2):022022, 2019. doi: 10.1088/1742-6596/1168/2/022022

work page doi:10.1088/1742-6596/1168/2/022022 2019
[43]

Gradient centralization: A new optimization technique for deep neural networks.European Conference on Computer Visio, 2020

Hongyang Yong, Jiancheng Huang, Xinyu Hua, and Lei Zhang. Gradient centralization: A new optimization technique for deep neural networks.European Conference on Computer Visio, 2020

work page 2020
[44]

Stochastic gradient sampling for enhancing neural networks train- ing.Neural Computing and Applications, 37:14005–14028, July 2025

Juyoung Yun. Stochastic gradient sampling for enhancing neural networks train- ing.Neural Computing and Applications, 37:14005–14028, July 2025. doi: 10.1007/ s00521-025-11242-1

work page 2025
[45]

Znorm: Z-score gradient normalization accelerating skip-connected network training without architectural modification

Juyoung Yun. Znorm: Z-score gradient normalization accelerating skip-connected network training without architectural modification. In Qingyun Wang, Wenpeng Yin, Abhishek Aich, Yumin Suh, and Kuan-Chuan Peng, editors,AI for Research and Scalable, Efficient Systems, pages 240–254, Singapore, 2025. Springer Nature Singapore

work page 2025
[46]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand- ing deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. doi: 10.48550/arXiv.1611.03530

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.03530 2017
[47]

Ga-sam: Gradient-strength based adap- tive sharpness-aware minimization for improved generalization

Zhiyuan Zhang, Ruixuan Luo, Qi Su, and Xu Sun. Ga-sam: Gradient-strength based adap- tive sharpness-aware minimization for improved generalization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. doi: 10.48550/arXiv.2210.06895. 10 Z-SCOREGRADIENTFILTERING FORSAM Appendix Appendix A. Overview Figure 3...

work page doi:10.48550/arxiv.2210.06895 2022
[48]

Then ZSharp-SAM satisfies 1 T T−1X t=0 E ∥∇L(wt)∥2 ≤ 4 T η L(w0)−E[L(w T )] + 8β2r2 b σ2 Ω + 4ηβ b σ2 Ω.(49) ProofFrom Lemma 4 (ZSharp-SAM one-step descent bound), for eachtwe have E[L(wt+1)]≤E[L(w t)]− η 4 E ∥∇L(wt)∥2 + 2ηβ2r2 b σ2 Ω + η2β b σ2 Ω.(50) Averaging (50) overt= 0, . . . , T−1yields 1 T T−1X t=0 E[L(wt+1)]≤ 1 T T−1X t=0 E[L(wt)]− η 4T T−1X t=0...

work page arXiv

[1] [1]

Automatic speech recognition: A survey of deep learning techniques and approaches.International Journal of Cognitive Computing in Engineering, 6:201–237, 2025

Harsh Ahlawat, Naveen Aggarwal, and Deepti Gupta. Automatic speech recognition: A survey of deep learning techniques and approaches.International Journal of Cognitive Computing in Engineering, 6:201–237, 2025. ISSN 2666-3074. doi: https://doi.org/10.1016/j.ijcce.2024. 12.007

work page doi:10.1016/j.ijcce.2024 2025

[2] [2]

Towards understanding sharpness-aware minimization

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InInternational Conference on Machine Learning (ICML), 2022. doi: 10. 48550/arXiv.2206.06232. Camera-ready version

work page arXiv 2022

[3] [3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G ´abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020

[5] [5]

Large-scale machine learning with stochastic gradient descent

L ´eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010, pages 177–186. Springer, 2010

work page 2010

[6] [6]

On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

Satrajit Chatterjee and Piotr Zielinski. On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022. doi: 10.48550/arXiv.2203.10036

work page doi:10.48550/arxiv.2203.10036 2022

[7] [7]

When vision transformers outperform resnets without pre-training or strong data augmentations.International Conference on Learn- ing Representations, 2022

Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations.International Conference on Learn- ing Representations, 2022

work page 2022

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[9] [9]

Sharpness-aware mini- mization for efficiently improving generalization.International Conference on Learning Rep- resentations, 2021

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization.International Conference on Learning Rep- resentations, 2021

work page 2021

[10] [10]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016

work page 2016

[11] [11]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016

[12] [12]

Hinton, Li Deng, Dong Yu, George E

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kings- bury. Deep neural networks for acoustic modeling in speech recognition.IEEE Signal Pro- cessing Magazine, 29(6), 2012

work page 2012

[13] [13]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997. 7 Z-SCOREGRADIENTFILTERING FORSAM

work page 1997

[14] [14]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning (ICML), 2015. doi: 10.48550/arXiv.1502.03167

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1502.03167 2015

[15] [15]

Generalization in deep learning

Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. InMathematical Aspects of Deep Learning. Cambridge University Press, 2022. doi: 10.1017/ 9781009025096.003. Also available as arXiv preprint arXiv:1710.05468

work page arXiv 2022

[16] [16]

On large-batch training for deep learning: Gen- eralization gap and sharp minima.International Conference on Learning Representations, 2017

Nitish Shirish Keskar, Jorge Nocedal, et al. On large-batch training for deep learning: Gen- eralization gap and sharp minima.International Conference on Learning Representations, 2017

work page 2017

[17] [17]

Fundamental convergence analysis of sharpness-aware minimization

Pham Duy Khanh, Hoang-Chau Luong, Boris Mordukhovich, and Dat Ba Tran. Fundamental convergence analysis of sharpness-aware minimization. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems, 2024

work page 2024

[18] [18]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations, 2015

work page 2015

[19] [19]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report

work page 2009

[20] [20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

work page 2012

[21] [21]

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks.arXiv preprint arXiv:2102.11600, 2021

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks.arXiv preprint arXiv:2102.11600, 2021

work page arXiv 2021

[22] [22]

Deep learning for natural language processing and language modelling

Piotr Kłosowski. Deep learning for natural language processing and language modelling. In 2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pages 223–228, 2018. doi: 10.23919/SPA.2018.8563389

work page doi:10.23919/spa.2018.8563389 2018

[23] [23]

An introduction to deep learning in natural language processing: Models, techniques, and tools.Neurocomputing, 470:443–456, 2022

Ivano Lauriola, Alberto Lavelli, and Fabio Aiolli. An introduction to deep learning in natural language processing: Models, techniques, and tools.Neurocomputing, 470:443–456, 2022. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2021.05.103

work page doi:10.1016/j.neucom.2021.05.103 2022

[24] [24]

Tiny imagenet visual recognition challenge, 2015

Ya Le and Xun Yang. Tiny imagenet visual recognition challenge, 2015

work page 2015

[25] [25]

Research on overfitting of deep learning

Haidong Li, Jiongcheng Li, Xiaoming Guan, Binghao Liang, Yuting Lai, and Xinglong Luo. Research on overfitting of deep learning. In2019 15th International Conference on Computa- tional Intelligence and Security (CIS), pages 78–81, 2019. doi: 10.1109/CIS.2019.00025

work page doi:10.1109/cis.2019.00025 2019

[26] [26]

Friendly sharpness- aware minimization

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[27] [27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. 8 Z-SCOREGRADIENTFILTERING FORSAM

work page 2019

[28] [28]

A review of deep learning techniques for speech processing.Information Fusion, 99:101869,

Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, and Soujanya Poria. A review of deep learning techniques for speech processing.Information Fusion, 99:101869,

work page

[29] [29]

doi: https://doi.org/10.1016/j.inffus.2023.101869

ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2023.101869

work page doi:10.1016/j.inffus.2023.101869 2023

[30] [30]

Make sharpness-aware minimization stronger: A sparsified perturbation approach

Peng Mi, Li Shen, Tianhe Ren, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji, and Dacheng Tao. Make sharpness-aware minimization stronger: A sparsified perturbation approach. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30950–30962. Curran Associates, Inc., 2022

work page 2022

[31] [31]

Exploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. InProceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 5949–5958, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017

[32] [32]

Exploring generalization in deep learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Exploring generalization in deep learning. InAdvances in Neural Information Processing Systems, pages 5947–5956, 2017

work page 2017

[33] [33]

Sharpness-aware minimization: General analysis and improved rates

Dimitris Oikonomou and Nicolas Loizou. Sharpness-aware minimization: General analysis and improved rates. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[34] [34]

ViT-CIFAR: PyTorch implementation for Vision Transformer on CIFAR datasets.https://github.com/omihub777/ViT-CIFAR, 2021

OmiHub777. ViT-CIFAR: PyTorch implementation for Vision Transformer on CIFAR datasets.https://github.com/omihub777/ViT-CIFAR, 2021. Accessed: 2025- 08-15

work page 2021

[35] [35]

On the difficulty of training recurrent neural networks.International Conference on Learning Representations, 2013

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks.International Conference on Learning Representations, 2013

work page 2013

[36] [36]

A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

work page 1951

[37] [37]

Overfitting Mechanism and Avoidance in Deep Neural Networks

Shaeke Salman and Xiuwen Liu. Overfitting mechanism and avoidance in deep neural net- works.arXiv preprint arXiv:1901.06566, 2019. doi: 10.48550/arXiv.1901.06566

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1901.06566 1901

[38] [38]

Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions.SN Computer Science, 2(6):420, 2021

Ihsan Hameed Sarker. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions.SN Computer Science, 2(6):420, 2021. doi: 10.1007/ s42979-021-00815-1

work page 2021

[39] [39]

Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85–117

J ¨urgen Schmidhuber. Deep learning in neural networks: An overview.Neural Networks, 61: 85–117, 2015. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2014.09.003

work page doi:10.1016/j.neunet.2014.09.003 2015

[40] [40]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations, 2015

work page 2015

[41] [41]

Gomez, tukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, tukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, pages 5998–6008, 2017. 9 Z-SCOREGRADIENTFILTERING FORSAM

work page 2017

[42] [42]

An overview of overfitting and its solutions.Journal of Physics: Conference Series, 1168(2):022022, 2019

Xue Ying. An overview of overfitting and its solutions.Journal of Physics: Conference Series, 1168(2):022022, 2019. doi: 10.1088/1742-6596/1168/2/022022

work page doi:10.1088/1742-6596/1168/2/022022 2019

[43] [43]

Gradient centralization: A new optimization technique for deep neural networks.European Conference on Computer Visio, 2020

Hongyang Yong, Jiancheng Huang, Xinyu Hua, and Lei Zhang. Gradient centralization: A new optimization technique for deep neural networks.European Conference on Computer Visio, 2020

work page 2020

[44] [44]

Stochastic gradient sampling for enhancing neural networks train- ing.Neural Computing and Applications, 37:14005–14028, July 2025

Juyoung Yun. Stochastic gradient sampling for enhancing neural networks train- ing.Neural Computing and Applications, 37:14005–14028, July 2025. doi: 10.1007/ s00521-025-11242-1

work page 2025

[45] [45]

Znorm: Z-score gradient normalization accelerating skip-connected network training without architectural modification

Juyoung Yun. Znorm: Z-score gradient normalization accelerating skip-connected network training without architectural modification. In Qingyun Wang, Wenpeng Yin, Abhishek Aich, Yumin Suh, and Kuan-Chuan Peng, editors,AI for Research and Scalable, Efficient Systems, pages 240–254, Singapore, 2025. Springer Nature Singapore

work page 2025

[46] [46]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand- ing deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. doi: 10.48550/arXiv.1611.03530

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.03530 2017

[47] [47]

Ga-sam: Gradient-strength based adap- tive sharpness-aware minimization for improved generalization

Zhiyuan Zhang, Ruixuan Luo, Qi Su, and Xu Sun. Ga-sam: Gradient-strength based adap- tive sharpness-aware minimization for improved generalization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. doi: 10.48550/arXiv.2210.06895. 10 Z-SCOREGRADIENTFILTERING FORSAM Appendix Appendix A. Overview Figure 3...

work page doi:10.48550/arxiv.2210.06895 2022

[48] [48]

Then ZSharp-SAM satisfies 1 T T−1X t=0 E ∥∇L(wt)∥2 ≤ 4 T η L(w0)−E[L(w T )] + 8β2r2 b σ2 Ω + 4ηβ b σ2 Ω.(49) ProofFrom Lemma 4 (ZSharp-SAM one-step descent bound), for eachtwe have E[L(wt+1)]≤E[L(w t)]− η 4 E ∥∇L(wt)∥2 + 2ηβ2r2 b σ2 Ω + η2β b σ2 Ω.(50) Averaging (50) overt= 0, . . . , T−1yields 1 T T−1X t=0 E[L(wt+1)]≤ 1 T T−1X t=0 E[L(wt)]− η 4T T−1X t=0...

work page arXiv