pith. sign in

arxiv: 2604.26772 · v1 · submitted 2026-04-29 · 💻 cs.CV

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Pith reviewed 2026-05-07 13:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionVision foundation modelsTunable attention poolingImage forensicsPatch tokensAIGI detectionDeepfake detection
0
0 comments X

The pith

Tunable attention pooling on modern vision foundation models boosts AI-generated image detection to new state-of-the-art.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks multiple families of vision foundation models for detecting AI-generated images and finds that newer models deliver stronger out-of-the-box features than the original CLIP vision transformer. To exploit the patch tokens these models produce, the authors replace the standard classifier head with tunable attention pooling, a lightweight mechanism that aggregates the tokens into a more refined global representation. When combined with the strongest VFMs, the change produces large accuracy gains on several benchmarks and reaches new state-of-the-art performance on two challenging in-the-wild tests that include both fully generated and inpainted images. A reader would care because the approach offers a practical, low-cost upgrade to existing forensic tools that must keep pace with rapidly advancing generative models.

Core claim

Out-of-the-box features from recent vision foundation models outperform those from the original CLIP-ViT for AI-generated image detection. A simple tunable attention pooling head that aggregates the model's output tokens into a refined global representation yields further substantial gains and establishes a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and inpainted images.

What carries the argument

Tunable attention pooling (TAP), a redesign of the classifier head that aggregates output tokens from the vision foundation model into a refined global representation.

If this is right

  • Outperforms the original CLIP by more than 12% accuracy.
  • Surpasses prior established methods across multiple AIGI detection benchmarks.
  • Sets new state-of-the-art results on two in-the-wild benchmarks for both fully generated and inpainted AI images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same TAP head could be attached to other downstream tasks that rely on patch-token representations from VFMs.
  • Newer VFMs trained with different objectives may need even less adaptation when paired with attention-based pooling for forensic use.
  • Direct comparison on images from entirely post-VFM generative architectures would test the claimed generalization.

Load-bearing premise

That features from the tested vision foundation models will remain discriminative for images produced by future unseen generative models and that the chosen benchmarks adequately capture real-world distribution shifts.

What would settle it

A new benchmark of images from a generative model released after the VFMs were trained on which the TAP-augmented detector shows no meaningful accuracy gain over the original CLIP baseline.

Figures

Figures reproduced from arXiv: 2604.26772 by Ahmed Abdullah, Nikolas Ebert, Oliver Wasenm\"uller.

Figure 1
Figure 1. Figure 1: Performance of our approach in terms of individual and view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE analysis of the feature spaces of three pretrained foundational ViT variants: CLIP-ViT-L/14 [ view at source ↗
Figure 3
Figure 3. Figure 3: Method Overview. Classical semantic feature extraction view at source ↗
read the original abstract

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks multiple families of vision foundation models (VFMs) as out-of-the-box feature extractors for detecting fully AI-generated images and AI-inpainted images. It reports that the strongest VFM exceeds the original CLIP-ViT by more than 12% accuracy and, after replacing the classifier head with tunable attention pooling (TAP) over patch tokens, establishes new state-of-the-art results on two in-the-wild AIGI/inpainting benchmarks.

Significance. If the empirical numbers hold, the work shows that architectural and training advances in recent VFMs can be directly exploited for AIGI detection with only a lightweight, tunable pooling head. The comprehensive cross-VFM evaluation and the simple TAP redesign could become a useful baseline for future forensics research.

major comments (2)
  1. [Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.
  2. [Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.
minor comments (2)
  1. [Section 3] Clarify the exact list of VFMs, their input resolutions, and pretraining objectives in a single table for reproducibility.
  2. [Method] Specify the precise formulation of TAP (number of attention heads, learnable parameters, initialization) and whether it is trained from scratch or fine-tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our generalization claims and the need for fuller experimental details. We address each major comment below and will revise the manuscript to incorporate clarifications and additional experiments.

read point-by-point responses
  1. Referee: [Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.

    Authors: We agree that an explicit hold-out test for future generators would strengthen the generalization claim. While our benchmarks already incorporate images from diverse, unseen generative models (as the VFMs were pretrained primarily on real data), we will add a new experiment in the revised manuscript using a recent hold-out generator family (e.g., a post-2023 diffusion model excluded from the original benchmarks) along with architecture-shift and post-processing tests to directly validate the claim. revision: yes

  2. Referee: [Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.

    Authors: We concur that complete experimental details are necessary for assessing the results. In the revised manuscript, we will expand the experimental section and add an appendix with: precise data splits for each benchmark, the number of independent runs with mean/std, applied statistical tests (e.g., significance testing for accuracy differences), and explicit confirmation that all VFMs were evaluated uniformly without post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarking on held-out test sets with independent VFM features

full rationale

The paper performs an empirical benchmark of out-of-the-box VFM patch tokens across multiple models and proposes a simple classifier-head redesign (TAP) whose parameters are trained on the detection task. All reported gains are measured on held-out test splits of standard AIGI benchmarks; no equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation, or input definition. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pretrained VFM patch tokens already contain forgery cues and that a small attention pooling head can extract them without backbone fine-tuning. No new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Pretrained vision foundation models produce patch tokens whose statistics differ between real and AI-generated images.
    Invoked when the authors treat the VFMs as fixed feature extractors.

pith-pipeline@v0.9.0 · 5546 in / 1229 out tokens · 28450 ms · 2026-05-07T13:38:24.997032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

  1. [1]

    Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images

    Ahmed Abdullah, Nikolas Ebert, and Oliver Wasenm ¨uller. Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images. InInter- national Conference on Pattern Recognition (ICPR), 2026. 3, 6

  2. [2]

    Flexivit: One model for all patch sizes

    Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 4

  3. [3]

    Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024

    Ollin Boer Bohan. Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024. Accessed: 2026-02-26. 5

  4. [4]

    Perception encoder: The best visual embeddings are not at the output of the net- work

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work. InNeural Information Processing Systems (NeurIPS),

  5. [5]

    Image manipulation detection by multi-view multi-scale supervision

    Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InInternational Conference on Computer Vi- sion (ICCV), 2021. 6

  6. [6]

    Xception: Deep learning with depthwise separable convolutions

    Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

  7. [7]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 2, 4

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InConference on Computer Vision and Pattern Recognition (CVPR), 2009. 4, 5

  9. [9]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations (ICLR), 2021. 2, 4

  10. [10]

    PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023

    Nikolas Ebert, Didier Stricker, and Oliver Wasenm ¨uller. PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023. 3

  11. [11]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML),

  12. [12]

    Leveraging fre- quency analysis for deep fake image recognition

    Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInterna- tional Conference on Machine Learning (ICML), 2020. 2

  13. [13]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 6

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 5

  15. [15]

    Masked autoencoders are scal- able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  16. [16]

    Lora: Low- rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 3

  17. [17]

    Progressive growing of gans for improved quality, stability, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations (ICLR), 2018. 3

  18. [18]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

  19. [19]

    Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection

    Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion (ECCV), 2024. 2, 3, 6, 7

  20. [20]

    Learning jpeg compression artifacts for image manipulation detection and localization

    Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung- Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision (IJCV), 2022. 6

  21. [21]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3

  22. [22]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 3

  23. [23]

    Detecting generated images by real im- ages

    Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real im- ages. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 6

  24. [24]

    Spatial- phase shallow learning: rethinking face forgery detection in frequency domain

    Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial- phase shallow learning: rethinking face forgery detection in frequency domain. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

  25. [25]

    Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022. 6

  26. [26]

    Global tex- ture enhancement for fake face detection in the wild

    Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global tex- ture enhancement for fake face detection in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 6

  27. [27]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

  28. [28]

    Gener- alizing face forgery detection with high-frequency features

    Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16317–16326, 2021. 5

  29. [29]

    arXiv preprint arXiv:2307.14863 (2023)

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Ham- madi, and Jizhe Zhou. Iml-vit: Benchmarking image ma- nipulation localization by vision transformer.arXiv preprint arXiv:2307.14863, 2023. 6

  30. [30]

    Towards uni- versal fake image detectors that generalize across genera- tive models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 5, 6, 7

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 7

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 3

  33. [33]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1

  34. [34]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean Conference on Computer Vision (ECCV). Springer, 2020. 5

  35. [35]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 2, 3, 4, 7, 8

  36. [36]

    Sam 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4, 7

  37. [37]

    Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019. 3

  38. [38]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 1, 3, 5

  39. [39]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4, 7

  40. [40]

    De- clip: Decoding clip representations for deepfake localization

    Stefan Smeu, Elisabeta Oneata, and Dan Oneata. De- clip: Decoding clip representations for deepfake localization. InWinter Conference on Applications of Computer Vision (WACV), 2025. 3, 6

  41. [41]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  42. [42]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InAAAI conference on Artificial Intelli- gence, 2024. 2, 6

  43. [43]

    Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3, 5, 6

  44. [44]

    C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InAAAI conference on Artificial Intelligence, 2025. 2, 3

  45. [45]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning (ICML), 2021. 5

  46. [46]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 4, 7, 8

  47. [47]

    Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017. 3

  48. [48]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems (NeurIPS), 2017. 4

  49. [49]

    Ob- jectformer for image manipulation detection and localiza- tion

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 6

  50. [50]

    Cnn-generated images are sur- prisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are sur- prisingly easy to spot... for now. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 3, 5, 6

  51. [51]

    Opensdi: Spotting diffusion-generated images in the open world

    Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. Opensdi: Spotting diffusion-generated images in the open world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3, 4, 6, 7, 8

  52. [52]

    Dire for diffusion-generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InInternational Con- ference on Computer Vision (ICCV, 2023. 5, 6

  53. [53]

    A sanity check for ai- generated image detection

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection. InInternational Conference on Learning Representations (ICLR), 2025. 1, 3, 4, 5, 6, 7

  54. [54]

    Deepfake detection that generalizes across benchmarks

    Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. In Winter Conference on Applications of Computer Vision (WACV), 2026. 2

  55. [55]

    Low-rank few- shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few- shot adaptation of vision-language models. InConference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. 3

  56. [56]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 4

  57. [57]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InIn- ternational Conference on Computer Vision (ICCV), 2023. 1

  58. [58]

    Detect- ing and simulating artifacts in gan fake images

    Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InIEEE in- ternational workshop on information forensics and security (WIFS), 2019. 3

  59. [59]

    Detect- ing and simulating artifacts in gan fake images

    Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InInter- national Workshop on Information Forensics and Security (WIFS). IEEE, 2019. 5

  60. [60]

    Patchcraft: Exploring texture patch for efficient ai-generated image detection

    Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023. 5, 6

  61. [61]

    Breaking latent prior bias in detectors for generaliz- able aigc image detection

    Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. Breaking latent prior bias in detectors for generaliz- able aigc image detection. InNeural Information Processing Systems (NeurIPS), 2025. 5, 6

  62. [62]

    Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025

    Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jin- hua Zeng, and Bin Li. Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025. 4

  63. [63]

    Gen- det: Towards good generalizations for ai-generated image detection

    Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023. 5

  64. [64]

    Genimage: A million-scale benchmark for detecting ai-generated image

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 1, 4, 5, 6