pith. sign in

arxiv: 2410.13891 · v3 · submitted 2024-10-13 · 💻 cs.CR · cs.AI

S⁴ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack

Pith reviewed 2026-05-23 19:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords transferable targeted attackadversarial attackimage transformationblack-box attackscaling transformationdata-free attacktransferabilityself-transferability
0
0 comments X

The pith

Attacking simple scaling transformations uniquely enhances targeted transferability under strict black-box conditions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that simple scaling stands apart from other image transformations in making targeted adversarial attacks transfer better to unseen models, even when no victim feedback or training data is available. It develops two blind measures, self-alignment and self-transferability, to rank transformations and spot correlations without violating black-box rules. These measures overturn the idea that only complex methods work well and instead point to scaling as the key driver of success. The resulting S4ST method combines consistent scaling, low-redundancy additions, and block-wise processing to reach strong performance with high speed. The authors tie scaling's power to the multi-scale character of natural images and the widespread use of scale augmentation during model training.

Core claim

Attacking simple scaling transformations uniquely enhances targeted transferability, outperforming other basic transformations and rivaling leading complex methods. Geometric and color transformations exhibit high internal redundancy despite weak inter-category correlations. The S4ST method integrates dimensionally consistent scaling, complementary low-redundancy transformations, and block-wise operations to deliver state-of-the-art effectiveness-efficiency balance in data-free settings. Scaling's effectiveness stems from visual data's multi-scale nature and ubiquitous scale augmentation during training, rendering such augmentation a double-edged sword. The framework generalizes to medical,

What carries the argument

Blind estimation measures of self-alignment and self-transferability that rank per-transformation effectiveness and cross-transformation correlations without any victim-model feedback or data

If this is right

  • S4ST achieves state-of-the-art targeted transferability without relying on victim data or feedback
  • Scaling's advantage arises directly from the multi-scale structure of visual data and common training augmentations
  • Geometric and color transformations largely overlap with each other and add little new value when combined
  • The same S4ST design transfers effectively to medical imaging and face verification tasks

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other common training augmentations such as rotation or brightness shifts may also create exploitable transfer gaps if examined with the same blind measures
  • Model trainers could reduce vulnerability by deliberately varying scale augmentation strength or omitting it for certain layers
  • The redundancy pattern among transformations offers a general rule for pruning transformation sets in any future attack pipeline

Load-bearing premise

The blind estimation measures of self-alignment and self-transferability accurately reflect per-transformation effectiveness and cross-transformation correlations under strict black-box constraints without any victim-model feedback or data

What would settle it

A controlled experiment that applies the S4ST attack to a new set of victim models never seen during measure development and checks whether scaling-based attacks actually transfer at higher rates than attacks using other basic transformations

Figures

Figures reproduced from arXiv: 2410.13891 by Bowen Peng, Li Liu, Xiang Li, Yongxiang Liu.

Figure 1
Figure 1. Figure 1: Comparison against existing transformation methods at incremental [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the original image and its transformed versions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scatter diagrams depicting relationships between [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Alignment between the surrogate model and black-box models [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The colored numbers indicate the cosine similarity between [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: PCCs calculated between the black-box transferability against 14 black-box models and the self-transferability against various basic transformations, based on AEs obtained from 12 existing attacks. features across scales. This correlation persists even with more complex transformations and a range of geometric ones, such as rotation, shear, and translation. 2) The self-transferability to color transformati… view at source ↗
Figure 8
Figure 8. Figure 8: Scatter diagrams depicting relationships between [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: PCCs between the self-transferability to scaling and other basic [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results by applying various basic transformations with incremental [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Our analysis uncovers significant untapped potential in existing [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: In addition to its effectiveness, our S4ST also exhibits superior efficiency. The additional time required by our S4ST is trivial, significantly outstripping the simple scaling and the integration of RDI [28] and DI [10] methods in terms of computational speed [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Further enhancements to S4ST-Base achieved by (a) S4ST-Block and (b) S4ST-Aug, both of which demonstrate significant improvements. efficiency. The incremental time investment required by S4ST￾Base compared to the baseline method (at r = 1.0) is minimal, and it markedly surpasses the competitors. 4.2.4 Effectiveness of S4ST Components Building upon the S4ST-Base framework, we delve into the benefits confer… view at source ↗
Figure 16
Figure 16. Figure 16: Real-world APIs-returned label list for targeted AEs generated by S [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Responses obtained from vision-language models for targeted AEs generated by S [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Targeted AEs, bolstered by S4ST, unveil the potential mechanism for enhancing targeted transferability through image transformations—by strengthening the manipulation of objects and textures within individual images to generate semantics that align with the intended target label. Pro 7B [76], with results shown in [PITH_FULL_IMAGE:figures/full_fig_p013_18.png] view at source ↗
read the original abstract

Transferable Targeted Attacks (TTAs) face significant challenges due to severe overfitting to surrogate models. Recent breakthroughs heavily rely on large-scale training data of victim models, while data-free solutions, \textit{i.e.}, image transformation-involved gradient optimization, often depend on black-box feedback for method design and tuning. These dependencies violate black-box transfer settings and compromise threat evaluation fairness. In this paper, we propose two blind estimation measures, self-alignment and self-transferability, to analyze per-transformation effectiveness and cross-transformation correlations under strict black-box constraints. Our findings challenge conventional assumptions: (1) Attacking simple scaling transformations uniquely enhances targeted transferability, outperforming other basic transformations and rivaling leading complex methods; (2) Geometric and color transformations exhibit high internal redundancy despite weak inter-category correlations. These insights drive the design and tuning of S$^4$ST (Strong, Self-transferable, faSt, Simple Scale Transformation), which integrates dimensionally consistent scaling, complementary low-redundancy transformations, and block-wise operations. Extensive evaluations across diverse architectures, training distributions, and tasks show that S$^{4}$ST achieves state-of-the-art effectiveness-efficiency balance without data dependency. We reveal that scaling's effectiveness stems from visual data's multi-scale nature and ubiquitous scale augmentation during training, rendering such augmentation a double-edged sword. Further validations on medical imaging and face verification confirm the framework's strong generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces two blind estimation measures—self-alignment and self-transferability—computed solely from surrogate gradients and image statistics under strict no-feedback black-box constraints. These measures are used to analyze per-transformation effectiveness and cross-transformation correlations, leading to the claim that simple scaling transformations uniquely enhance targeted transferability (outperforming other basic transformations and rivaling complex methods). This insight drives the design of S⁴ST, which combines dimensionally consistent scaling, low-redundancy transformations, and block-wise operations. The method is reported to achieve SOTA effectiveness-efficiency balance across architectures, distributions, and tasks (including medical imaging and face verification) without data dependency or victim-model feedback. The paper attributes scaling's effectiveness to the multi-scale nature of visual data and ubiquitous scale augmentation in training.

Significance. If the blind measures prove predictive of actual black-box targeted transfer rates, the work would provide a data-free, feedback-free framework for designing and tuning TTAs, challenging reliance on complex methods or large victim-model datasets. It offers explicit credit for reproducible empirical evaluations across diverse settings and for identifying a simple transformation (scaling) that rivals leading approaches. The generalization experiments on non-standard domains strengthen the case for broader applicability.

major comments (2)
  1. [Abstract / measure definitions] Abstract and the sections introducing the measures (around the proposed self-alignment/self-transferability definitions): the central claim that scaling uniquely boosts targeted transferability rests on these two measures correctly ranking transformations and revealing correlations. However, no experiment directly correlates the blind measure rankings with measured black-box targeted success rates on held-out victim models; the measures are used both to discover the scaling insight and to tune S⁴ST, creating a potential circularity if the measures are only weakly correlated with true transferability.
  2. [Evaluation sections (S⁴ST results)] The experimental sections reporting S⁴ST performance: because S⁴ST hyperparameters and transformation choices are selected via the unvalidated blind measures, the reported SOTA results on victim models may not demonstrate that the scaling insight generalizes beyond the surrogate used to compute the measures.
minor comments (2)
  1. [Title] The title contains inconsistent capitalization (faSt); consider standardizing to 'S⁴ST: A Strong, Self-transferable, Fast, and Simple Scale Transformation for Transferable Targeted Attack'.
  2. [Method section] Notation for the two new measures should be introduced with explicit formulas or pseudocode in the main text rather than relying solely on prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of the blind measures and the generalization experiments. We address the two major comments point-by-point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / measure definitions] Abstract and the sections introducing the measures (around the proposed self-alignment/self-transferability definitions): the central claim that scaling uniquely boosts targeted transferability rests on these two measures correctly ranking transformations and revealing correlations. However, no experiment directly correlates the blind measure rankings with measured black-box targeted success rates on held-out victim models; the measures are used both to discover the scaling insight and to tune S⁴ST, creating a potential circularity if the measures are only weakly correlated with true transferability.

    Authors: We agree that a direct correlation study between the blind measure rankings and actual black-box targeted transfer rates on held-out victim models is absent from the current manuscript and would strengthen the claims. The measures are strictly computed from surrogate gradients and image statistics under no-feedback constraints, and the final S⁴ST results on diverse victim models provide supporting evidence; however, this does not replace an explicit validation of the measures' predictive power. In the revision we will add a dedicated experiment that ranks transformations by the two measures on the surrogate and directly compares those rankings against measured targeted success rates on multiple held-out victim models. This addition will address the circularity concern. revision: yes

  2. Referee: [Evaluation sections (S⁴ST results)] The experimental sections reporting S⁴ST performance: because S⁴ST hyperparameters and transformation choices are selected via the unvalidated blind measures, the reported SOTA results on victim models may not demonstrate that the scaling insight generalizes beyond the surrogate used to compute the measures.

    Authors: We acknowledge the concern that hyperparameter and transformation selection via the surrogate-derived measures could limit claims of generalization. While S⁴ST is evaluated on held-out victim models across architectures, distributions, and tasks (including medical imaging and face verification), the selection process itself was not cross-validated against victim performance. In the revised manuscript we will expand the evaluation sections with (i) an explicit correlation analysis between measure rankings and victim transfer rates and (ii) additional results obtained by re-selecting transformations using an alternative surrogate, thereby demonstrating that the scaling insight is not surrogate-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity; measures presented as independent blind estimators

full rationale

The paper introduces self-alignment and self-transferability as new blind estimation measures computed from surrogate gradients and image statistics under strict black-box constraints. These are used to analyze transformations and derive the scaling insight that then informs S⁴ST design. No equations, definitions, or self-citations are shown that reduce the measures to fitted parameters, self-referential predictions, or load-bearing prior results by the same authors. The derivation chain remains self-contained against external benchmarks, with the measures functioning as independent analysis tools rather than quantities defined in terms of the target transferability outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the validity of two newly introduced blind measures and the domain assumption that visual data's multi-scale nature plus scale augmentation during training explains scaling's attack effectiveness; no free parameters or invented physical entities are visible.

axioms (1)
  • domain assumption Visual data possesses a multi-scale nature and ubiquitous scale augmentation is used during model training
    Invoked to explain why scaling transformations are effective for transferable attacks.
invented entities (2)
  • self-alignment measure no independent evidence
    purpose: Blind estimation of per-transformation effectiveness under black-box constraints
    Newly proposed in the paper; no independent evidence supplied in abstract.
  • self-transferability measure no independent evidence
    purpose: Blind estimation of cross-transformation correlations under black-box constraints
    Newly proposed in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5797 in / 1344 out tokens · 27896 ms · 2026-05-23T19:20:40.103995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition

    cs.CV 2026-04 unverdicted novelty 6.0

    Light-ResKAN reaches 99.09% accuracy on MSTAR SAR images with 82.9 times fewer FLOPs and 163.78 times fewer parameters than VGG16 by combining KAN convolutions, Gram polynomials, and channel-wise parameter sharing.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Intriguing properties of neural networks,

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014. 1

  2. [2]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018. 1, 3, 11, 14

  3. [3]

    Ensemble adversarial training: Attacks and defenses,

    F. Tram `er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P . McDaniel, “Ensemble adversarial training: Attacks and defenses,” in International Conference on Learning Representations, 2018. 1, 11

  4. [4]

    Revisiting auc-oriented adversarial training with loss-agnostic perturbations,

    Z. Yang, Q. Xu, W. Hou, S. Bao, Y. He, X. Cao, and Q. Huang, “Revisiting auc-oriented adversarial training with loss-agnostic perturbations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1

  5. [5]

    Improving fast adversarial training with prior-guided knowledge,

    X. Jia, Y. Zhang, X. Wei, B. Wu, K. Ma, J. Wang, and X. Cao, “Improving fast adversarial training with prior-guided knowledge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

  6. [6]

    Explaining and harnessing adversarial examples,

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations), 2015. 1

  7. [7]

    A survey on transferability of adversarial examples across deep neural networks,

    J. Gu, X. Jia, P . de Jorge, W. Yu, X. Liu, A. Ma, Y. Xun, A. Hu, A. Khakzar, Z. Li et al., “A survey on transferability of adversarial examples across deep neural networks,” TMLR, 2024. 1

  8. [8]

    Boosting adversarial attacks with momentum,

    Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018. 1, 3, 10, 11

  9. [9]

    Evading defenses to transferable adversarial examples by translation-invariant attacks,

    Y. Dong, T. Pang, H. Su, and J. Zhu, “Evading defenses to transferable adversarial examples by translation-invariant attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4312–4321. 1, 3, 4, 10, 11

  10. [10]

    Improving transferability of adversarial examples with input diversity,

    C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2730–2739. 1, 2, 3, 4, 8, 9, 10

  11. [11]

    https://github.com/pytorch/vision/tree/main/torchvision/ transforms SUBMITTED TO IEEE TPAMI 15

  12. [12]

    Structure invariant transformation for better adversarial transferability,

    X. Wang, Z. Zhang, and J. Zhang, “Structure invariant transformation for better adversarial transferability,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 4607–

  13. [13]

    1, 4, 7, 8, 9, 10, 11

  14. [14]

    Boosting adversarial transferability by block shuffle and rotation,

    K. Wang, X. He, W. Wang, and X. Wang, “Boosting adversarial transferability by block shuffle and rotation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  15. [15]

    Boosting adversarial transferability across model genus by deformation-constrained warping,

    Q. Lin, C. Luo, Z. Niu, X. He, W. Xie, Y. Hou, L. Shen, and S. Song, “Boosting adversarial transferability across model genus by deformation-constrained warping,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 4, pp. 3459–3467, Mar

  16. [16]

    Understanding adversarial examples from the mutual influence of images and perturbations,

    C. Zhang, P . Benz, T. Imtiaz, and I. S. Kweon, “Understanding adversarial examples from the mutual influence of images and perturbations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020. 1, 3, 9, 11, 13

  17. [17]

    On generating transferable targeted perturbations,

    M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli, “On generating transferable targeted perturbations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 7708–

  18. [18]

    Exploring non-target knowledge for improving ensemble universal adversarial attacks,

    J. Weng, Z. Luo, Z. Zhong, D. Lin, and S. Li, “Exploring non-target knowledge for improving ensemble universal adversarial attacks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2768–2775. 1, 3

  19. [19]

    Towards transferable targeted adversarial examples,

    Z. Wang, H. Yang, Y. Feng, P . Sun, H. Guo, Z. Zhang, and K. Ren, “Towards transferable targeted adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 20 534–20 543. 1

  20. [20]

    Minimizing maximum model discrepancy for transferable black-box targeted attacks,

    A. Zhao, T. Chu, Y. Liu, W. Li, J. Li, and L. Duan, “Minimizing maximum model discrepancy for transferable black-box targeted attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 8153–8162. 1, 4, 9, 10, 11, 12

  21. [21]

    Rethinking adversarial transferability from a data distribution perspective,

    Y. Zhu, J. Sun, and Z. Li, “Rethinking adversarial transferability from a data distribution perspective,” in International Conference on Learning Representations, 2022. 1, 3, 9, 11

  22. [22]

    Toward understanding and boosting adversarial transferability from a distribution perspective,

    Y. Zhu, Y. Chen, X. Li, K. Chen, Y. He, X. Tian, B. Zheng, Y. Chen, and Q. Huang, “Toward understanding and boosting adversarial transferability from a distribution perspective,” IEEE Transactions on Image Processing, vol. 31, pp. 6487–6501, 2022. 1, 3, 9, 11

  23. [23]

    Improving transferable targeted adversarial attacks with model self-enhancement,

    H. Wu, G. Ou, W. Wu, and Z. Zheng, “Improving transferable targeted adversarial attacks with model self-enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 24 615–24 624. 1, 9, 11, 12

  24. [24]

    Delving into transferable adversarial examples and black-box attacks,

    Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” in International Conference on Learning Representations, 2017. 1

  25. [25]

    Towards transferable targeted attack,

    M. Li, C. Deng, T. Li, J. Yan, X. Gao, and H. Huang, “Towards transferable targeted attack,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020. 2, 3

  26. [26]

    On success and simplicity: A second look at transferable targeted attacks,

    Z. Zhao, Z. Liu, and M. Larson, “On success and simplicity: A second look at transferable targeted attacks,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P . Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 6115–6128. 2, 3, 4, 8, 13

  27. [27]

    Logit margin matters: Improving transferable targeted adversarial attack by logit calibration,

    J. Weng, Z. Luo, S. Li, N. Sebe, and Z. Zhong, “Logit margin matters: Improving transferable targeted adversarial attack by logit calibration,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 3561–3574, 2023. 2, 3, 4

  28. [28]

    On single- model transferable targeted attacks: A closer look at decision-level optimization,

    X. Sun, G. Cheng, H. Li, L. Pei, and J. Han, “On single- model transferable targeted attacks: A closer look at decision-level optimization,” IEEE Transactions on Image Processing, vol. 32, pp. 2972–2984, 2023. 2, 3, 4

  29. [29]

    Rethinking data augmentation for improving transferable targeted attacks,

    Z. Wei, J. Chen, Z. Wu, and Y.-G. Jiang, “Rethinking data augmentation for improving transferable targeted attacks,” 2023. [Online]. Available: https://openreview.net/forum?id=go0P5gsBE2 2, 3, 4, 9, 10, 11

  30. [30]

    Improving the transferability of adversarial examples with resized-diverse-inputs, diversity-ensemble and region fitting,

    J. Zou, Z. Pan, J. Qiu, X. Liu, T. Rui, and W. Li, “Improving the transferability of adversarial examples with resized-diverse-inputs, diversity-ensemble and region fitting,” in European Conference on Computer Vision. Springer International Publishing, 2020, pp. 563–

  31. [31]

    Introducing competition to boost the transferability of targeted adversarial examples through clean feature mixup,

    J. Byun, M.-J. Kwon, S. Cho, Y. Kim, and C. Kim, “Introducing competition to boost the transferability of targeted adversarial examples through clean feature mixup,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 24 648–24 657. 2, 3, 9, 11, 12

  32. [32]

    Adversarial examples in the physical world,

    A. Kurakin, I. Goodfellow, S. Bengio et al., “Adversarial examples in the physical world,” in International Conference on Learning Representations, 2017. 3

  33. [33]

    Randaugment: Practical automated data augmentation with a reduced search space,

    E. D. Cubuk, B. Zoph, J. Shlens, and Q. V . Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 702–703. 3

  34. [34]

    Improving the transferability of targeted adversarial examples through object- based diverse input,

    J. Byun, S. Cho, M.-J. Kwon, H.-S. Kim, and C. Kim, “Improving the transferability of targeted adversarial examples through object- based diverse input,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 15 244–15 253. 3, 4, 8, 9, 10, 11, 14

  35. [35]

    Enhancing the self-universality for transferable targeted attacks,

    Z. Wei, J. Chen, Z. Wu, and Y.-G. Jiang, “Enhancing the self-universality for transferable targeted attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 12 281–12 290. 3, 4, 8, 9, 11

  36. [36]

    Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability,

    N. Inkawhich, K. Liang, B. Wang, M. Inkawhich, L. Carin, and Y. Chen, “Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 20 791– 20 801. 3

  37. [37]

    A little robustness goes a long way: Leveraging robust features for targeted transfer attacks,

    J. Springer, M. Mitchell, and G. Kenyon, “A little robustness goes a long way: Leveraging robust features for targeted transfer attacks,” Advances in Neural Information Processing Systems, vol. 34, pp. 9759– 9773, 2021. 3, 9, 11

  38. [38]

    Sharpness-aware minimization for efficiently improving generalization,

    P . Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in International Conference on Learning Representations, 2021. 3

  39. [39]

    Boosting transferability of targeted adversarial examples via hierarchical generative networks,

    X. Yang, Y. Dong, T. Pang, H. Su, and J. Zhu, “Boosting transferability of targeted adversarial examples via hierarchical generative networks,” in European Conference on Computer Vision. Springer, 2022, pp. 725–742. 4, 9, 11

  40. [40]

    A survey on image data augmentation for deep learning,

    C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019. 4

  41. [41]

    Data augmentation for improving deep learning in image classification problem,

    A. Mikołajczyk and M. Grochowski, “Data augmentation for improving deep learning in image classification problem,” in 2018 international interdisciplinary PhD workshop (IIPhDW). IEEE, 2018, pp. 117–122. 4

  42. [42]

    Improving deep learning with generic data augmentation,

    L. Taylor and G. Nitschke, “Improving deep learning with generic data augmentation,” in 2018 IEEE symposium series on computational intelligence (SSCI). IEEE, 2018, pp. 1542–1547. 4

  43. [43]

    Automa: Towards automatic model augmentation for transferable adversarial attacks,

    H. Yuan, Q. Chu, F. Zhu, R. Zhao, B. Liu, and N. Yu, “Automa: Towards automatic model augmentation for transferable adversarial attacks,” IEEE Transactions on Multimedia, vol. 25, pp. 203–213, 2021. 4

  44. [44]

    Adaptive image transformations for transfer-based adversarial attack,

    Z. Yuan, J. Zhang, and S. Shan, “Adaptive image transformations for transfer-based adversarial attack,” in European Conference on Computer Vision. Springer, 2022, pp. 1–17. 4

  45. [45]

    Learning to transform dynamically for better adversarial transferability,

    R. Zhu, Z. Zhang, S. Liang, Z. Liu, and C. Xu, “Learning to transform dynamically for better adversarial transferability,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 24 273–24 283. 4

  46. [46]

    Nesterov accelerated gradient and scale invariance for adversarial attacks,

    J. Lin, C. Song, K. He, L. Wang, and J. E. Hopcroft, “Nesterov accelerated gradient and scale invariance for adversarial attacks,” in International Conference on Learning Representations, 2020. 4, 9, 10

  47. [47]

    Admix: Enhancing the transferability of adversarial attacks,

    X. Wang, X. He, J. Wang, and K. He, “Admix: Enhancing the transferability of adversarial attacks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 16 158– 16 167. 4, 9, 10

  48. [48]

    Frequency domain model augmentation for adversarial attack,

    Y. Long, Q. Zhang, B. Zeng, L. Gao, X. Liu, J. Zhang, and J. Song, “Frequency domain model augmentation for adversarial attack,” in European Conference on Computer Vision. Springer International Publishing, 2022, pp. 549–566. 4, 7, 9, 10

  49. [49]

    Universal adversarial attack on attention and the resulting dataset damagenet,

    S. Chen, Z. He, C. Sun, J. Yang, and X. Huang, “Universal adversarial attack on attention and the resulting dataset damagenet,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 2188–2197, 2020. 4

  50. [50]

    Feature importance-aware transferable adversarial attacks,

    Z. Wang, H. Guo, Z. Zhang, W. Liu, Z. Qin, and K. Ren, “Feature importance-aware transferable adversarial attacks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7639–7648. 4

  51. [51]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626. 4

  52. [52]

    Analysis of representations for domain adaptation,

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” NeurIPS, 2006. 5

  53. [53]

    Position: The platonic representation hypothesis,

    M. Huh, B. Cheung, T. Wang, and P . Isola, “Position: The platonic representation hypothesis,” in Forty-first International Conference on Machine Learning. 6

  54. [54]

    Learning transferable adversarial examples via ghost networks,

    Y. Li, S. Bai, Y. Zhou, C. Xie, Z. Zhang, and A. Yuille, “Learning transferable adversarial examples via ghost networks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 458–11 465, Apr. 2020. 6 SUBMITTED TO IEEE TPAMI 16

  55. [55]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016. 8

  56. [56]

    Mobilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018. 9

  57. [57]

    EfficientNet: Rethinking model scaling for convolutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the International Conference on Machine Learning, vol. 97. PMLR, 09–15 Jun 2019, pp. 6105–6114. 9

  58. [58]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 11 976–11 986. 9

  59. [59]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June

  60. [60]

    Inception-v4, inception-resnet and the impact of residual connections on learning,

    C. Szegedy, S. Ioffe, V . Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, Feb. 2017. 9

  61. [61]

    Xception: Deep learning with depthwise separable convolutions,

    F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017. 9

  62. [62]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. 9, 10

  63. [63]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 10 012–10 022. 9

  64. [64]

    Maxvit: Multi-axis vision transformer,

    Z. Tu, H. Talebi, H. Zhang, F. Yang, P . Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” in European Conference on Computer Vision. Springer International Publishing, 2022, pp. 459–

  65. [65]

    Twins: Revisiting the design of spatial attention in vision transformers,

    X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 9355–9366. 9

  66. [66]

    Rethinking spatial dimensions of vision transformers,

    B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking spatial dimensions of vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 11 936–11 945. 9

  67. [67]

    Transformer in transformer,

    K. Han, A. Xiao, E. Wu, J. Guo, C. XU, and Y. Wang, “Transformer in transformer,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P . Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 15 908–15 919. 9

  68. [68]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, July 2021, pp. 10 347–10 357. 9

  69. [69]

    Pytorch image models,

    R. Wightman, “Pytorch image models,” https://github.com/ rwightman/pytorch-image-models, 2019. 9

  70. [70]

    Pytorch: An imperative style, high- performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high- performance deep learning library,” in Advances in Neural Information Processing ...

  71. [71]

    Densely connected convolutional networks,

    G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July

  72. [72]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015. 10

  73. [73]

    Augmix: A simple method to improve robustness and uncertainty under data shift,

    D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple method to improve robustness and uncertainty under data shift,” in International Conference on Learning Representations, 2020. 10, 11

  74. [74]

    Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

    R. Geirhos, P . Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” in International Conference on Learning Representations, 2019. 11

  75. [75]

    Do adversarially robust imagenet models transfer better?

    H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry, “Do adversarially robust imagenet models transfer better?” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 3533–3545. 11

  76. [76]

    CogVLM2: Visual Language Models for Image and Video Understanding

    W. Hong, W. Wang, M. Ding, W. Yu, Q. Lv, Y. Wang, Y. Cheng, S. Huang, J. Ji, Z. Xue et al., “Cogvlm2: Visual language models for image and video understanding,” arXiv preprint arXiv:2408.16500, 2024. 12

  77. [77]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P . Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024. 12

  78. [78]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv:2501.17811, 2025. 12, 13

  79. [79]

    Optimism in the face of adversity: Understanding and improving deep learning through adversarial robustness,

    G. Ortiz-Jim ´enez, A. Modas, S.-M. Moosavi-Dezfooli, and P . Frossard, “Optimism in the face of adversity: Understanding and improving deep learning through adversarial robustness,” Proceedings of the IEEE, vol. 109, no. 5, pp. 635–659, 2021. 14

  80. [80]

    Adversarial examples make strong poisons,

    L. Fowl, M. Goldblum, P .-y. Chiang, J. Geiping, W. Czaja, and T. Goldstein, “Adversarial examples make strong poisons,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 339–30 351,

Showing first 80 references.