TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah; Nikolas Ebert; Oliver Wasenm\"uller

arxiv: 2604.26772 · v1 · submitted 2026-04-29 · 💻 cs.CV

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah , Nikolas Ebert , Oliver Wasenm\"uller This is my paper

Pith reviewed 2026-05-07 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionVision foundation modelsTunable attention poolingImage forensicsPatch tokensAIGI detectionDeepfake detection

0 comments

The pith

Tunable attention pooling on modern vision foundation models boosts AI-generated image detection to new state-of-the-art.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks multiple families of vision foundation models for detecting AI-generated images and finds that newer models deliver stronger out-of-the-box features than the original CLIP vision transformer. To exploit the patch tokens these models produce, the authors replace the standard classifier head with tunable attention pooling, a lightweight mechanism that aggregates the tokens into a more refined global representation. When combined with the strongest VFMs, the change produces large accuracy gains on several benchmarks and reaches new state-of-the-art performance on two challenging in-the-wild tests that include both fully generated and inpainted images. A reader would care because the approach offers a practical, low-cost upgrade to existing forensic tools that must keep pace with rapidly advancing generative models.

Core claim

Out-of-the-box features from recent vision foundation models outperform those from the original CLIP-ViT for AI-generated image detection. A simple tunable attention pooling head that aggregates the model's output tokens into a refined global representation yields further substantial gains and establishes a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and inpainted images.

What carries the argument

Tunable attention pooling (TAP), a redesign of the classifier head that aggregates output tokens from the vision foundation model into a refined global representation.

If this is right

Outperforms the original CLIP by more than 12% accuracy.
Surpasses prior established methods across multiple AIGI detection benchmarks.
Sets new state-of-the-art results on two in-the-wild benchmarks for both fully generated and inpainted AI images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same TAP head could be attached to other downstream tasks that rely on patch-token representations from VFMs.
Newer VFMs trained with different objectives may need even less adaptation when paired with attention-based pooling for forensic use.
Direct comparison on images from entirely post-VFM generative architectures would test the claimed generalization.

Load-bearing premise

That features from the tested vision foundation models will remain discriminative for images produced by future unseen generative models and that the chosen benchmarks adequately capture real-world distribution shifts.

What would settle it

A new benchmark of images from a generative model released after the VFMs were trained on which the TAP-augmented detector shows no meaningful accuracy gain over the original CLIP baseline.

Figures

Figures reproduced from arXiv: 2604.26772 by Ahmed Abdullah, Nikolas Ebert, Oliver Wasenm\"uller.

**Figure 1.** Figure 1: Performance of our approach in terms of individual and view at source ↗

**Figure 2.** Figure 2: t-SNE analysis of the feature spaces of three pretrained foundational ViT variants: CLIP-ViT-L/14 [ view at source ↗

**Figure 3.** Figure 3: Method Overview. Classical semantic feature extraction view at source ↗

read the original abstract

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Newer VFMs with a TAP head lift AIGI detection numbers over CLIP baselines on existing benchmarks, but the gains rest on how well those benchmarks represent future generators.

read the letter

The main thing to know is that swapping to stronger vision foundation models and adding a tunable attention pooling head on their patch tokens improves accuracy for spotting fully generated and inpainted AI images compared with the usual CLIP setups. The paper also runs a wider comparison across VFM families than most prior work has done. They evaluate out-of-the-box features from models with different pretraining, resolutions, and sizes, then show that the best ones already beat earlier detectors by a clear margin before the TAP change is even applied. The head itself is a light redesign that pools the output tokens into a better global descriptor, and the combined system reaches new reported highs on two in-the-wild benchmarks. That part is straightforward and potentially useful for anyone who wants a quick upgrade to existing forensic pipelines without retraining large backbones. The soft spot is the generalization story. The central claim depends on the chosen benchmarks capturing the distribution shifts that will come from the next round of generators, yet the work does not describe dedicated hold-out tests on entirely new architectures or heavy post-processing that have tripped up earlier detectors. Without those, it is hard to separate real robustness from performance tuned to the current test suites. The experiments appear empirical and non-circular, with performance measured on held-out data, but the abstract-level numbers need the full splits, variance, and any statistical checks to be convincing. This paper is aimed at people building practical detection tools who follow the VFM literature. A reader looking for a concrete, reproducible head modification and a multi-model comparison will find value in it. It deserves peer review because the core experiment is simple enough to check and the comparison is broad enough to be informative, even if reviewers will probably push for stronger shift tests.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks multiple families of vision foundation models (VFMs) as out-of-the-box feature extractors for detecting fully AI-generated images and AI-inpainted images. It reports that the strongest VFM exceeds the original CLIP-ViT by more than 12% accuracy and, after replacing the classifier head with tunable attention pooling (TAP) over patch tokens, establishes new state-of-the-art results on two in-the-wild AIGI/inpainting benchmarks.

Significance. If the empirical numbers hold, the work shows that architectural and training advances in recent VFMs can be directly exploited for AIGI detection with only a lightweight, tunable pooling head. The comprehensive cross-VFM evaluation and the simple TAP redesign could become a useful baseline for future forensics research.

major comments (2)

[Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.
[Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.

minor comments (2)

[Section 3] Clarify the exact list of VFMs, their input resolutions, and pretraining objectives in a single table for reproducibility.
[Method] Specify the precise formulation of TAP (number of attention heads, learnable parameters, initialization) and whether it is trained from scratch or fine-tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our generalization claims and the need for fuller experimental details. We address each major comment below and will revise the manuscript to incorporate clarifications and additional experiments.

read point-by-point responses

Referee: [Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.

Authors: We agree that an explicit hold-out test for future generators would strengthen the generalization claim. While our benchmarks already incorporate images from diverse, unseen generative models (as the VFMs were pretrained primarily on real data), we will add a new experiment in the revised manuscript using a recent hold-out generator family (e.g., a post-2023 diffusion model excluded from the original benchmarks) along with architecture-shift and post-processing tests to directly validate the claim. revision: yes
Referee: [Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.

Authors: We concur that complete experimental details are necessary for assessing the results. In the revised manuscript, we will expand the experimental section and add an appendix with: precise data splits for each benchmark, the number of independent runs with mean/std, applied statistical tests (e.g., significance testing for accuracy differences), and explicit confirmation that all VFMs were evaluated uniformly without post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarking on held-out test sets with independent VFM features

full rationale

The paper performs an empirical benchmark of out-of-the-box VFM patch tokens across multiple models and proposes a simple classifier-head redesign (TAP) whose parameters are trained on the detection task. All reported gains are measured on held-out test splits of standard AIGI benchmarks; no equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation, or input definition. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pretrained VFM patch tokens already contain forgery cues and that a small attention pooling head can extract them without backbone fine-tuning. No new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Pretrained vision foundation models produce patch tokens whose statistics differ between real and AI-generated images.
Invoked when the authors treat the VFMs as fixed feature extractors.

pith-pipeline@v0.9.0 · 5546 in / 1229 out tokens · 28450 ms · 2026-05-07T13:38:24.997032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

[1]

Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images

Ahmed Abdullah, Nikolas Ebert, and Oliver Wasenm ¨uller. Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images. InInter- national Conference on Pattern Recognition (ICPR), 2026. 3, 6

work page 2026
[2]

Flexivit: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 4

work page 2023
[3]

Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024

Ollin Boer Bohan. Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024. Accessed: 2026-02-26. 5

work page 2024
[4]

Perception encoder: The best visual embeddings are not at the output of the net- work

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work. InNeural Information Processing Systems (NeurIPS),

work page
[5]

Image manipulation detection by multi-view multi-scale supervision

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InInternational Conference on Computer Vi- sion (ICCV), 2021. 6

work page 2021
[6]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

work page 2017
[7]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 2, 4

work page 2023
[8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InConference on Computer Vision and Pattern Recognition (CVPR), 2009. 4, 5

work page 2009
[9]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations (ICLR), 2021. 2, 4

work page 2021
[10]

PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023

Nikolas Ebert, Didier Stricker, and Oliver Wasenm ¨uller. PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023. 3

work page 2023
[11]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML),

work page
[12]

Leveraging fre- quency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInterna- tional Conference on Machine Learning (ICML), 2020. 2

work page 2020
[13]

Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 6

work page 2023
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 5

work page 2016
[15]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[16]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 3

work page 2022
[17]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations (ICLR), 2018. 3

work page 2018
[18]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

work page 2019
[19]

Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection

Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion (ECCV), 2024. 2, 3, 6, 7

work page 2024
[20]

Learning jpeg compression artifacts for image manipulation detection and localization

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung- Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision (IJCV), 2022. 6

work page 2022
[21]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3

work page 2024
[22]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 3

work page 2025
[23]

Detecting generated images by real im- ages

Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real im- ages. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 6

work page 2022
[24]

Spatial- phase shallow learning: rethinking face forgery detection in frequency domain

Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial- phase shallow learning: rethinking face forgery detection in frequency domain. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

work page 2021
[25]

Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022. 6

work page 2022
[26]

Global tex- ture enhancement for fake face detection in the wild

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global tex- ture enhancement for fake face detection in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 6

work page 2020
[27]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

work page 2021
[28]

Gener- alizing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16317–16326, 2021. 5

work page 2021
[29]

arXiv preprint arXiv:2307.14863 (2023)

Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Ham- madi, and Jizhe Zhou. Iml-vit: Benchmarking image ma- nipulation localization by vision transformer.arXiv preprint arXiv:2307.14863, 2023. 6

work page arXiv 2023
[30]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 5, 6, 7

work page 2023
[31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 7

work page internal anchor Pith review arXiv 2023
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[33]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1

work page 2024
[34]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean Conference on Computer Vision (ECCV). Springer, 2020. 5

work page 2020
[35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 2, 3, 4, 7, 8

work page 2021
[36]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4, 7

work page 2025
[37]

Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019. 3

work page 2019
[38]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 1, 3, 5

work page 2022
[39]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4, 7

work page internal anchor Pith review arXiv 2025
[40]

De- clip: Decoding clip representations for deepfake localization

Stefan Smeu, Elisabeta Oneata, and Dan Oneata. De- clip: Decoding clip representations for deepfake localization. InWinter Conference on Applications of Computer Vision (WACV), 2025. 3, 6

work page 2025
[41]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[42]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InAAAI conference on Artificial Intelli- gence, 2024. 2, 6

work page 2024
[43]

Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3, 5, 6

work page 2024
[44]

C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InAAAI conference on Artificial Intelligence, 2025. 2, 3

work page 2025
[45]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning (ICML), 2021. 5

work page 2021
[46]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 4, 7, 8

work page internal anchor Pith review arXiv 2025
[47]

Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017. 3

work page 2017
[48]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems (NeurIPS), 2017. 4

work page 2017
[49]

Ob- jectformer for image manipulation detection and localiza- tion

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 6

work page 2022
[50]

Cnn-generated images are sur- prisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are sur- prisingly easy to spot... for now. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 3, 5, 6

work page 2020
[51]

Opensdi: Spotting diffusion-generated images in the open world

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. Opensdi: Spotting diffusion-generated images in the open world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025
[52]

Dire for diffusion-generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InInternational Con- ference on Computer Vision (ICCV, 2023. 5, 6

work page 2023
[53]

A sanity check for ai- generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection. InInternational Conference on Learning Representations (ICLR), 2025. 1, 3, 4, 5, 6, 7

work page 2025
[54]

Deepfake detection that generalizes across benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. In Winter Conference on Applications of Computer Vision (WACV), 2026. 2

work page 2026
[55]

Low-rank few- shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few- shot adaptation of vision-language models. InConference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. 3

work page 2024
[56]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 4

work page 2022
[57]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InIn- ternational Conference on Computer Vision (ICCV), 2023. 1

work page 2023
[58]

Detect- ing and simulating artifacts in gan fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InIEEE in- ternational workshop on information forensics and security (WIFS), 2019. 3

work page 2019
[59]

Detect- ing and simulating artifacts in gan fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InInter- national Workshop on Information Forensics and Security (WIFS). IEEE, 2019. 5

work page 2019
[60]

Patchcraft: Exploring texture patch for efficient ai-generated image detection

Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023. 5, 6

work page arXiv 2023
[61]

Breaking latent prior bias in detectors for generaliz- able aigc image detection

Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. Breaking latent prior bias in detectors for generaliz- able aigc image detection. InNeural Information Processing Systems (NeurIPS), 2025. 5, 6

work page 2025
[62]

Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jin- hua Zeng, and Bin Li. Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025. 4

work page arXiv 2025
[63]

Gen- det: Towards good generalizations for ai-generated image detection

Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023. 5

work page arXiv 2023
[64]

Genimage: A million-scale benchmark for detecting ai-generated image

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 1, 4, 5, 6

work page 2023

[1] [1]

Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images

Ahmed Abdullah, Nikolas Ebert, and Oliver Wasenm ¨uller. Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images. InInter- national Conference on Pattern Recognition (ICPR), 2026. 3, 6

work page 2026

[2] [2]

Flexivit: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 4

work page 2023

[3] [3]

Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024

Ollin Boer Bohan. Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024. Accessed: 2026-02-26. 5

work page 2024

[4] [4]

Perception encoder: The best visual embeddings are not at the output of the net- work

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work. InNeural Information Processing Systems (NeurIPS),

work page

[5] [5]

Image manipulation detection by multi-view multi-scale supervision

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InInternational Conference on Computer Vi- sion (ICCV), 2021. 6

work page 2021

[6] [6]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

work page 2017

[7] [7]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 2, 4

work page 2023

[8] [8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InConference on Computer Vision and Pattern Recognition (CVPR), 2009. 4, 5

work page 2009

[9] [9]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations (ICLR), 2021. 2, 4

work page 2021

[10] [10]

PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023

Nikolas Ebert, Didier Stricker, and Oliver Wasenm ¨uller. PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023. 3

work page 2023

[11] [11]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML),

work page

[12] [12]

Leveraging fre- quency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInterna- tional Conference on Machine Learning (ICML), 2020. 2

work page 2020

[13] [13]

Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 6

work page 2023

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 5

work page 2016

[15] [15]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022

[16] [16]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 3

work page 2022

[17] [17]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations (ICLR), 2018. 3

work page 2018

[18] [18]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

work page 2019

[19] [19]

Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection

Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion (ECCV), 2024. 2, 3, 6, 7

work page 2024

[20] [20]

Learning jpeg compression artifacts for image manipulation detection and localization

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung- Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision (IJCV), 2022. 6

work page 2022

[21] [21]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3

work page 2024

[22] [22]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 3

work page 2025

[23] [23]

Detecting generated images by real im- ages

Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real im- ages. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 6

work page 2022

[24] [24]

Spatial- phase shallow learning: rethinking face forgery detection in frequency domain

Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial- phase shallow learning: rethinking face forgery detection in frequency domain. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

work page 2021

[25] [25]

Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022. 6

work page 2022

[26] [26]

Global tex- ture enhancement for fake face detection in the wild

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global tex- ture enhancement for fake face detection in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 6

work page 2020

[27] [27]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

work page 2021

[28] [28]

Gener- alizing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16317–16326, 2021. 5

work page 2021

[29] [29]

arXiv preprint arXiv:2307.14863 (2023)

Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Ham- madi, and Jizhe Zhou. Iml-vit: Benchmarking image ma- nipulation localization by vision transformer.arXiv preprint arXiv:2307.14863, 2023. 6

work page arXiv 2023

[30] [30]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 5, 6, 7

work page 2023

[31] [31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 7

work page internal anchor Pith review arXiv 2023

[32] [32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023

[33] [33]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1

work page 2024

[34] [34]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean Conference on Computer Vision (ECCV). Springer, 2020. 5

work page 2020

[35] [35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 2, 3, 4, 7, 8

work page 2021

[36] [36]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4, 7

work page 2025

[37] [37]

Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019. 3

work page 2019

[38] [38]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 1, 3, 5

work page 2022

[39] [39]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4, 7

work page internal anchor Pith review arXiv 2025

[40] [40]

De- clip: Decoding clip representations for deepfake localization

Stefan Smeu, Elisabeta Oneata, and Dan Oneata. De- clip: Decoding clip representations for deepfake localization. InWinter Conference on Applications of Computer Vision (WACV), 2025. 3, 6

work page 2025

[41] [41]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[42] [42]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InAAAI conference on Artificial Intelli- gence, 2024. 2, 6

work page 2024

[43] [43]

Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3, 5, 6

work page 2024

[44] [44]

C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InAAAI conference on Artificial Intelligence, 2025. 2, 3

work page 2025

[45] [45]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning (ICML), 2021. 5

work page 2021

[46] [46]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 4, 7, 8

work page internal anchor Pith review arXiv 2025

[47] [47]

Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017. 3

work page 2017

[48] [48]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems (NeurIPS), 2017. 4

work page 2017

[49] [49]

Ob- jectformer for image manipulation detection and localiza- tion

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 6

work page 2022

[50] [50]

Cnn-generated images are sur- prisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are sur- prisingly easy to spot... for now. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 3, 5, 6

work page 2020

[51] [51]

Opensdi: Spotting diffusion-generated images in the open world

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. Opensdi: Spotting diffusion-generated images in the open world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025

[52] [52]

Dire for diffusion-generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InInternational Con- ference on Computer Vision (ICCV, 2023. 5, 6

work page 2023

[53] [53]

A sanity check for ai- generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection. InInternational Conference on Learning Representations (ICLR), 2025. 1, 3, 4, 5, 6, 7

work page 2025

[54] [54]

Deepfake detection that generalizes across benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. In Winter Conference on Applications of Computer Vision (WACV), 2026. 2

work page 2026

[55] [55]

Low-rank few- shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few- shot adaptation of vision-language models. InConference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. 3

work page 2024

[56] [56]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 4

work page 2022

[57] [57]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InIn- ternational Conference on Computer Vision (ICCV), 2023. 1

work page 2023

[58] [58]

Detect- ing and simulating artifacts in gan fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InIEEE in- ternational workshop on information forensics and security (WIFS), 2019. 3

work page 2019

[59] [59]

Detect- ing and simulating artifacts in gan fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InInter- national Workshop on Information Forensics and Security (WIFS). IEEE, 2019. 5

work page 2019

[60] [60]

Patchcraft: Exploring texture patch for efficient ai-generated image detection

Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023. 5, 6

work page arXiv 2023

[61] [61]

Breaking latent prior bias in detectors for generaliz- able aigc image detection

Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. Breaking latent prior bias in detectors for generaliz- able aigc image detection. InNeural Information Processing Systems (NeurIPS), 2025. 5, 6

work page 2025

[62] [62]

Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jin- hua Zeng, and Bin Li. Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025. 4

work page arXiv 2025

[63] [63]

Gen- det: Towards good generalizations for ai-generated image detection

Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023. 5

work page arXiv 2023

[64] [64]

Genimage: A million-scale benchmark for detecting ai-generated image

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 1, 4, 5, 6

work page 2023