pith. sign in

arxiv: 2606.00738 · v1 · pith:ZHGZQT4Gnew · submitted 2026-05-30 · 💻 cs.LG · cs.AI· cs.CV

SORA: Free Second-Order Attacks in Fast Adversarial Training

Pith reviewed 2026-06-28 19:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords adversarial trainingcatastrophic overfittingepsilon overfittingadversarial robustnessfast adversarial trainingperturbation alignmentadaptive step size
0
0 comments X

The pith

SORA adapts perturbation step sizes during fast adversarial training using a gradient alignment metric to prevent catastrophic overfitting and improve both robustness and clean accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes epsilon overfitting as the tendency of fixed perturbation magnitudes and directions to trigger catastrophic overfitting in single-step adversarial training, where multi-step robustness suddenly collapses. It introduces PertAlign, a cheap metric based on gradient alignment across successive attack stages, that predicts when this collapse will occur from the geometry of the loss surface. SORA then uses the metric to adjust step sizes on the fly without extra hyperparameters or post-training fixes. The resulting method matches or exceeds prior robustness numbers while raising clean accuracy and runs with one fixed hyperparameter set on multiple datasets and architectures.

Core claim

SORA is an adaptive step-size adversarial training procedure that dynamically changes perturbation magnitudes according to the PertAlign score. PertAlign measures how aligned the gradients remain across attack iterations and thereby signals the onset of catastrophic overfitting. By responding to this signal the method eliminates the collapse, reaches state-of-the-art robust and clean accuracy, and transfers across data sets and models with unchanged hyperparameters.

What carries the argument

PertAlign, a metric that quantifies gradient alignment across attack stages to predict catastrophic overfitting onset and drives the adaptive step-size rule inside SORA.

If this is right

  • SORA removes the need for dataset-specific hyperparameter search in fast adversarial training.
  • The same fixed hyperparameter set yields both higher clean accuracy and competitive robustness on multiple benchmarks.
  • Perturbation variability introduced by the adaptive rule improves robust generalization beyond what fixed-magnitude attacks achieve.
  • The method remains computationally negligible because PertAlign adds almost no overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If gradient alignment tracks a general property of adversarial loss surfaces, the same signal could guide step-size choice in multi-step or certified training pipelines.
  • The success with fixed hyperparameters suggests that earlier methods may have required per-dataset tuning partly because they lacked an online indicator of overfitting risk.
  • Monitoring loss-surface geometry through cheap alignment statistics may be useful in other robust-optimization settings where sudden performance drops occur.

Load-bearing premise

PertAlign derived from gradient alignment across attack stages reliably forecasts the start of catastrophic overfitting in time for the adaptation rule to act without dataset-specific retuning.

What would settle it

Run SORA on a previously unseen architecture or data set and check whether multi-step adversarial accuracy still drops sharply after the PertAlign threshold is crossed.

Figures

Figures reproduced from arXiv: 2606.00738 by Farzan Rahmani, Mazdak Teymourian, Mohammad Hossein Rohban, Ramtin Moslemi.

Figure 1
Figure 1. Figure 1: Evolution of the loss landscape geometry around a sample during FGSM training. Left: Early in training, before CO, the model is robust, with FGSM and PGD accuracies closely matched. The loss surface is approximately semilinear. Middle: A few batch updates before CO, the decision boundary begins to wrap around the original and adversarial examples, forming a nonlinear region that is not yet misclassified. R… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of a model exhibiting CO (a) with a robust model (b), evaluated using FGSM and PGD-10. 3.1. Evolution of Loss Landscape Geometry A striking property of CO is that it does not degrade clean accuracy, yet it often causes a substantial increase in FGSM accuracy, sometimes even surpassing clean accuracy on the test set. This counter-intuitive effect prompted us to examine FGSM accuracy across a rang… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Tracking PertAlign during FGSM AT and SORA AT for CIFAR-10 dataset, PertAlign collapses on the occurrence of CO. Right: At batch 3775 of FGSM AT, PertAlign and GradAlign begin to drop, forecasting CO, while FGSM and PGD accuracies visibly diverge only around batch 3825. Other metrics, AAE Share of the Batch, Scaled KL Divergence from TRADES, and Scaled ELLE nonlinearity measure, react later in respon… view at source ↗
Figure 4
Figure 4. Figure 4: When the decision boundary becomes distorted, single-step attacks may produce AAEs, whereas multi-step attacks such as PGD can still reliably generate NAEs. By adapting the attack step-size to the local linearity of the loss surface, SORA can also produce NAEs in such scenarios. Training on AAEs tends to exacerbate distortion in the loss surface, while training on NAEs can guide the model toward recovery. … view at source ↗
Figure 5
Figure 5. Figure 5: Theoretically derived and self-adaptive α ∗ during SORA AT with ϵ = 8/255 on CIFAR-10 (a) and PATHMNIST (b). with a temporal gap. Similar to the approach of Shafahi et al. (2019a), which generates adversarial examples with a temporal gap, we use the optimal step-size with a lag, enabling our method to exploit second-order information essentially for free. Since the weights change only slightly after each u… view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of different methods across datasets and architectures. Left: SORA attains the highest robust accuracy among single-step methods. Right: Corresponding clean accuracy. 10 15 20 25 Training Time (Minutes) 0 10 20 30 40 Accuracy (%) NuAT ELLE ATAS MultiGrad ZeroGrad AAER NFGSM GradAlign FGSM-RS Free AT SORA (a) Time 0.5 1.0 1.5 2.0 Maximum Allocated Memory (GB) 0 10 20 30 40 Accuracy (%) NuAT ELLE … view at source ↗
Figure 7
Figure 7. Figure 7: Training time vs. memory usage on PATHMNIST with PreActResNet-18, trained for 30 epochs, measured on an NVIDIA GeForce RTX 4090 GPU. The ⋆ marks SORA. The vertical axis in both figures represents PGD-10 accuracy. 2022), AAER (Lin et al., 2024b), and ELLE (Rocamora et al., 2024). We also evaluate multi-step AT methods, in￾cluding PGD-2 and PGD-10, as well as TRADES (Zhang et al., 2019b), which serve as uppe… view at source ↗
Figure 8
Figure 8. Figure 8: Class distributions across datasets. CIFAR-10, CIFAR-100, TINYIMAGENET, and IMAGENET-100 exhibit balanced class distributions, whereas PATHMNIST and TISSUEMNIST are imbalanced, with TISSUEMNIST showing the most pronounced imbalance. Dataset Selection Criteria In line with the previous work done on FAT, we included the CIFAR-10 and CIFAR-100 datasets which are the two most commonly used datasets in the comm… view at source ↗
Figure 9
Figure 9. Figure 9: Dataset samples and their corresponding FGSM adversarial examples for a model adversarially trained with the SORA method. Each row corresponds to a dataset (CIFAR-10, CIFAR-100, TINYIMAGENET, IMAGENET-100, PATHMNIST, and TISSUEMNIST, respectively). The “Clean” columns show the original images, the “Perturbed” columns show the FGSM adversarial examples, and the “Perturbation” columns visualize the added per… view at source ↗
Figure 10
Figure 10. Figure 10: Overlay of FGSM accuracies showing the effect of different training ϵ values on the EO peak location and sharpness. Varying the training value of ϵ influences the peak of FGSM accuracy following CO [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of EO occurrence across datasets with models exhibiting CO. EO is also evident across multiple datasets, as shown in [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Tracking SORA’s α ∗ across datasets and models. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Loss landscape visualization for different datasets and training methods, trained using PreActResNet-18. As shown in Figures 13(a), 13(d), and 13(g), the distortion of the loss around the origin (α = 0 and β = 0), which arises naturally from standard training, does not exhibit a consistent or distinctive pattern. In contrast, the distortion caused by CO shows clear similarities across different datasets (… view at source ↗
Figure 14
Figure 14. Figure 14: Tracking different metrics during FGSM and SORA AT. During robust training where CO does not occur ( [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tracking PertAlign across datasets, models, and FAT methods. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training time vs. memory usage on CIFAR-10 with PreActResNet-18, trained for 30 epochs, measured on an NVIDIA GeForce RTX 4090 GPU. The ⋆ marks SORA. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_16.png] view at source ↗
read the original abstract

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to address catastrophic overfitting (CO) in fast adversarial training by formalizing Epsilon Overfitting (EO) as arising from fixed perturbation magnitudes/directions, introducing PertAlign as a theoretically grounded metric that predicts CO onset via gradient alignment across attack stages, and proposing SORA as an adaptive step-size method that dynamically adjusts perturbations based on loss surface geometry. It asserts that SORA prevents CO, achieves SOTA robustness and clean accuracy, and generalizes across datasets/architectures with one fixed hyperparameter set, backed by extensive experiments and public code.

Significance. If the results hold, the work would be significant for practical adversarial robustness, as a method that avoids CO and dataset-specific tuning while maintaining efficiency would improve applicability of fast AT. The public code availability is a clear strength for reproducibility.

major comments (3)
  1. [Abstract] Abstract: the central claim that PertAlign is 'theoretically grounded' and enables a 'single fixed set of hyperparameters' without post-hoc tuning rests on an unshown mapping from the alignment statistic to the step-size rule; no equation or derivation is provided to establish independence from the training loss surface.
  2. [Method (PertAlign)] Method section on PertAlign: the metric is described as derived from gradient alignment on the same loss surface used for training, which creates a potential circularity risk for predicting CO onset; an explicit test showing that the statistic tracks multi-step robustness collapse independently of the fitted adaptive behavior is required to support the no-tuning claim.
  3. [Experiments] Experiments: the generalization and SOTA claims are load-bearing, yet the absence of error bars, multiple random seeds, or statistical tests leaves open whether observed improvements over baselines are robust or could be explained by variance.
minor comments (2)
  1. The abstract states 'negligible cost' for PertAlign; the experiments should report wall-clock overhead relative to standard fast AT to substantiate this.
  2. Tables comparing to prior methods should explicitly state whether all baselines use the same hyperparameter search budget as SORA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that PertAlign is 'theoretically grounded' and enables a 'single fixed set of hyperparameters' without post-hoc tuning rests on an unshown mapping from the alignment statistic to the step-size rule; no equation or derivation is provided to establish independence from the training loss surface.

    Authors: We agree that the abstract's reference to theoretical grounding would be strengthened by an explicit derivation of the mapping from PertAlign to the step-size rule and its independence from the training loss surface. In the revised manuscript we will add this derivation in the Method section, including the relevant equation linking the alignment statistic to the adaptive rule. revision: yes

  2. Referee: [Method (PertAlign)] Method section on PertAlign: the metric is described as derived from gradient alignment on the same loss surface used for training, which creates a potential circularity risk for predicting CO onset; an explicit test showing that the statistic tracks multi-step robustness collapse independently of the fitted adaptive behavior is required to support the no-tuning claim.

    Authors: We acknowledge the circularity concern. To support the no-tuning claim we will add an explicit experiment in the revised paper that evaluates PertAlign on fixed-step-size training runs (independent of SORA) and shows that the metric still tracks the onset of multi-step robustness collapse. revision: yes

  3. Referee: [Experiments] Experiments: the generalization and SOTA claims are load-bearing, yet the absence of error bars, multiple random seeds, or statistical tests leaves open whether observed improvements over baselines are robust or could be explained by variance.

    Authors: We agree that the absence of error bars and multi-seed statistics weakens the load-bearing claims. In the revision we will report all main results over at least three random seeds with error bars and include statistical significance tests for the key comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained against external benchmarks

full rationale

The provided abstract and description introduce EO as a formalization and PertAlign as a new metric based on gradient alignment across attack stages, with SORA as the resulting adaptive method. No equations, self-citations, or fitted parameters are shown that reduce any prediction or central claim to its own inputs by construction. The claims rest on the metric's independent predictive power for CO onset and generalization with fixed hyperparameters, which the text presents as externally verifiable through experiments rather than tautological. This is the most common honest finding for papers without explicit self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the adaptive adjustment is described at high level without explicit fitting details or background assumptions listed.

pith-pipeline@v0.9.1-grok · 5752 in / 1130 out tokens · 23997 ms · 2026-06-28T19:02:49.886933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 50 canonical work pages · 15 internal anchors

  1. [1]

    2014 , eprint=

    Intriguing properties of neural networks , author=. 2014 , eprint=

  2. [2]

    Explaining and Harnessing Adversarial Examples

    Ian J. Goodfellow and Jonathon Shlens and Christian Szegedy , year=. 1412.6572 , archivePrefix=

  3. [3]

    Universal adversarial perturbations

    Seyed-Mohsen Moosavi-Dezfooli and Alhussein Fawzi and Omar Fawzi and Pascal Frossard , year=. 1610.08401 , archivePrefix=

  4. [4]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry and Aleksandar Makelov and Ludwig Schmidt and Dimitris Tsipras and Adrian Vladu , year=. 1706.06083 , archivePrefix=

  5. [5]

    1805.12152 , archivePrefix=

    Dimitris Tsipras and Shibani Santurkar and Logan Engstrom and Alexander Turner and Aleksander Madry , year=. 1805.12152 , archivePrefix=

  6. [6]

    Davis and Tom Goldstein , year=

    Ali Shafahi and Mahyar Najibi and Zheng Xu and John Dickerson and Larry S. Davis and Tom Goldstein , year=. 1811.11304 , archivePrefix=

  7. [7]

    1812.02637 , archivePrefix=

    Gavin Weiguang Ding and Yash Sharma and Kry Yik Chau Lui and Ruitong Huang , year=. 1812.02637 , archivePrefix=

  8. [8]

    Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

    Anish Athalye and Nicholas Carlini and David Wagner , year=. 1802.00420 , archivePrefix=

  9. [9]

    Theoretically Principled Trade-off between Robustness and Accuracy

    Hongyang Zhang and Yaodong Yu and Jiantao Jiao and Eric P. Xing and Laurent El Ghaoui and Michael I. Jordan , year=. 1901.08573 , archivePrefix=

  10. [10]

    Davis and Gavin Taylor and Tom Goldstein , year=

    Ali Shafahi and Mahyar Najibi and Amin Ghiasi and Zheng Xu and John Dickerson and Christoph Studer and Larry S. Davis and Gavin Taylor and Tom Goldstein , year=. 1904.12843 , archivePrefix=

  11. [11]

    1905.00877 , archivePrefix=

    Dinghuai Zhang and Tianyuan Zhang and Yiping Lu and Zhanxing Zhu and Bin Dong , year=. 1905.00877 , archivePrefix=

  12. [12]

    1912.11969 , archivePrefix=

    Haizhong Zheng and Ziqi Zhang and Juncheng Gu and Honglak Lee and Atul Prakash , year=. 1912.11969 , archivePrefix=

  13. [13]

    2003.01690 , archivePrefix=

    Francesco Croce and Matthias Hein , year=. 2003.01690 , archivePrefix=

  14. [14]

    2103.15670 , archivePrefix=

    Rulin Shao and Zhouxing Shi and Jinfeng Yi and Pin-Yu Chen and Cho-Jui Hsieh , year=. 2103.15670 , archivePrefix=

  15. [15]

    2020 , eprint=

    Fast is better than free: Revisiting adversarial training , author=. 2020 , eprint=

  16. [16]

    2002.10097 , archivePrefix=

    Leo Schwinn and René Raab and Björn Eskofier , year=. 2002.10097 , archivePrefix=

  17. [17]

    Vivek B. S. and R. Venkatesh Babu , year=. 2004.08628 , archivePrefix=

  18. [18]

    2006.03089 , archivePrefix=

    Bai Li and Shiqi Wang and Suman Jana and Lawrence Carin , year=. 2006.03089 , archivePrefix=

  19. [19]

    2007.02617 , archivePrefix=

    Maksym Andriushchenko and Nicolas Flammarion , year=. 2007.02617 , archivePrefix=

  20. [20]

    2010.01799 , archivePrefix=

    Hoki Kim and Woojin Lee and Jaewook Lee , year=. 2010.01799 , archivePrefix=

  21. [21]

    2021 , url=

    Gaurang Sriramanan and Sravanti Addepalli and Arya Baburaj and Venkatesh Babu Radhakrishnan , booktitle=. 2021 , url=

  22. [22]

    Venkatesh Babu , year=

    Gaurang Sriramanan and Sravanti Addepalli and Arya Baburaj and R. Venkatesh Babu , year=. 2011.14969 , archivePrefix=

  23. [23]

    2103.15476 , archivePrefix=

    Zeinab Golgooni and Mehrdad Saberi and Masih Eskandar and Mohammad Hossein Rohban , year=. 2103.15476 , archivePrefix=

  24. [24]

    2105.02942 , archivePrefix=

    Peilin Kang and Seyed-Mohsen Moosavi-Dezfooli , year=. 2105.02942 , archivePrefix=

  25. [25]

    2021 , eprint=

    Adaptive perturbation adversarial training: based on reinforcement learning , author=. 2021 , eprint=

  26. [26]

    2112.12376 , archivePrefix=

    Yihua Zhang and Guanhua Zhang and Prashant Khanduri and Mingyi Hong and Shiyu Chang and Sijia Liu , year=. 2112.12376 , archivePrefix=

  27. [27]

    Pau de Jorge and Adel Bibi and Riccardo Volpi and Amartya Sanyal and Philip H. S. Torr and Grégory Rogez and Puneet K. Dokania , year=. 2202.01181 , archivePrefix=

  28. [28]

    2206.02417 , archivePrefix=

    Zhichao Huang and Yanbo Fan and Chen Liu and Weizhong Zhang and Yong Zhang and Mathieu Salzmann and Sabine Süsstrunk and Jue Wang , year=. 2206.02417 , archivePrefix=

  29. [29]

    2207.08859 , archivePrefix=

    Xiaojun Jia and Yong Zhang and Xingxing Wei and Baoyuan Wu and Ke Ma and Jue Wang and Xiaochun Cao , year=. 2207.08859 , archivePrefix=

  30. [30]

    2207.10498 , archivePrefix=

    Boxi Wu and Jindong Gu and Zhifeng Li and Deng Cai and Xiaofei He and Wei Liu , year=. 2207.10498 , archivePrefix=

  31. [31]

    2304.00202 , archivePrefix=

    Xiaojun Jia and Yong Zhang and Xingxing Wei and Baoyuan Wu and Ke Ma and Jue Wang and Xiaochun Cao , year=. 2304.00202 , archivePrefix=

  32. [32]

    2310.08847 , archivePrefix=

    Runqi Lin and Chaojian Yu and Bo Han and Tongliang Liu , year=. 2310.08847 , archivePrefix=

  33. [33]

    2310.18975 , archivePrefix=

    Mahdi Salmani and Alireza Dehghanpour Farashah and Mohammad Azizmalayeri and Mahdi Amiri and Navid Eslami and Mohammad Taghi Manzuri and Mohammad Hossein Rohban , year=. 2310.18975 , archivePrefix=

  34. [34]

    Chrysos and Pablo M

    Elias Abad Rocamora and Fanghui Liu and Grigorios G. Chrysos and Pablo M. Olmos and Volkan Cevher , year=. 2401.11618 , archivePrefix=

  35. [35]

    2404.08154 , archivePrefix=

    Runqi Lin and Chaojian Yu and Tongliang Liu , year=. 2404.08154 , archivePrefix=

  36. [36]

    2405.16262 , archivePrefix=

    Runqi Lin and Chaojian Yu and Bo Han and Hang Su and Tongliang Liu , year=. 2405.16262 , archivePrefix=

  37. [37]

    2407.12443 , archivePrefix=

    Zhaoxin Wang and Handing Wang and Cong Tian and Yaochu Jin , year=. 2407.12443 , archivePrefix=

  38. [38]

    2408.03944 , archivePrefix=

    Jie Gui and Chengze Jiang and Minjing Dong and Kun Tong and Xinli Shi and Yuan Yan Tang and Dacheng Tao , year=. 2408.03944 , archivePrefix=

  39. [39]

    2025 , month =

    PLOS ONE , publisher =. 2025 , month =. doi:10.1371/journal.pone.0317023 , author =

  40. [40]
  41. [41]

    Robustness of classifiers: from adversarial to random noise

    Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard , year=. 1608.08967 , archivePrefix=

  42. [42]

    Classification regions of deep neural networks

    Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard and Stefano Soatto , year=. 1705.09552 , archivePrefix=

  43. [43]

    Robustness via curvature regularization, and vice versa

    Seyed-Mohsen Moosavi-Dezfooli and Alhussein Fawzi and Jonathan Uesato and Pascal Frossard , year=. 1811.09716 , archivePrefix=

  44. [44]

    2004.01832 , archivePrefix=

    Avery Ma and Fartash Faghri and Nicolas Papernot and Amir Massoud Farahmand , year=. 2004.01832 , archivePrefix=

  45. [45]

    2006.00731 , archivePrefix=

    Sahil Singla and Soheil Feizi , year=. 2006.00731 , archivePrefix=

  46. [46]

    2009.04923 , archivePrefix=

    Theodoros Tsiligkaridis and Jay Roberts , year=. 2009.04923 , archivePrefix=

  47. [47]

    2110.01858 , archivePrefix=

    Benyamin Ghojogh and Ali Ghodsi and Fakhri Karray and Mark Crowley , year=. 2110.01858 , archivePrefix=

  48. [48]

    2207.01396 , archivePrefix=

    Yaguan Qian and Yuqi Wang and Bin Wang and Zhaoquan Gu and Yuhan Guo and Wassim Swaileh , year=. 2207.01396 , archivePrefix=

  49. [49]

    Deep Residual Learning for Image Recognition

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , year=. 1512.03385 , archivePrefix=

  50. [50]

    Identity Mappings in Deep Residual Networks

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , year=. 1603.05027 , archivePrefix=

  51. [51]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis , year=. 1605.07146 , archivePrefix=

  52. [52]

    Squeeze-and-Excitation Networks

    Jie Hu and Li Shen and Samuel Albanie and Gang Sun and Enhua Wu , year=. 1709.01507 , archivePrefix=

  53. [53]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , year=. 2010.11929 , archivePrefix=

  54. [54]

    Alex Krizhevsky , year=

  55. [55]

    ImageNet Large Scale Visual Recognition Challenge

    Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and Alexander C. Berg and Li Fei-Fei , year=. 1409.0575 , archivePrefix=

  56. [56]

    2009 , volume=

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. 2009 , volume=

  57. [57]

    Ng , booktitle=

    Yuval Netzer and Tao Wang and Adam Coates and Alessandro Bissacco and Bo Wu and Andrew Y. Ng , booktitle=. 2011 , url=

  58. [58]

    2015 , url=

    Le, Ya and Yang, Xuan , journal=. 2015 , url=

  59. [59]

    2023 , publisher=

    Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing , journal=. 2023 , publisher=

  60. [60]

    2026 , url =

    The Emerging Science of Machine Learning Benchmarks , author =. 2026 , url =

  61. [61]

    2026 , publisher=

    High-Dimensional Probability: An Introduction with Applications in Data Science , author=. 2026 , publisher=