TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection
Pith reviewed 2026-05-07 13:38 UTC · model grok-4.3
The pith
Tunable attention pooling on modern vision foundation models boosts AI-generated image detection to new state-of-the-art.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Out-of-the-box features from recent vision foundation models outperform those from the original CLIP-ViT for AI-generated image detection. A simple tunable attention pooling head that aggregates the model's output tokens into a refined global representation yields further substantial gains and establishes a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and inpainted images.
What carries the argument
Tunable attention pooling (TAP), a redesign of the classifier head that aggregates output tokens from the vision foundation model into a refined global representation.
If this is right
- Outperforms the original CLIP by more than 12% accuracy.
- Surpasses prior established methods across multiple AIGI detection benchmarks.
- Sets new state-of-the-art results on two in-the-wild benchmarks for both fully generated and inpainted AI images.
Where Pith is reading between the lines
- The same TAP head could be attached to other downstream tasks that rely on patch-token representations from VFMs.
- Newer VFMs trained with different objectives may need even less adaptation when paired with attention-based pooling for forensic use.
- Direct comparison on images from entirely post-VFM generative architectures would test the claimed generalization.
Load-bearing premise
That features from the tested vision foundation models will remain discriminative for images produced by future unseen generative models and that the chosen benchmarks adequately capture real-world distribution shifts.
What would settle it
A new benchmark of images from a generative model released after the VFMs were trained on which the TAP-augmented detector shows no meaningful accuracy gain over the original CLIP baseline.
Figures
read the original abstract
Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks multiple families of vision foundation models (VFMs) as out-of-the-box feature extractors for detecting fully AI-generated images and AI-inpainted images. It reports that the strongest VFM exceeds the original CLIP-ViT by more than 12% accuracy and, after replacing the classifier head with tunable attention pooling (TAP) over patch tokens, establishes new state-of-the-art results on two in-the-wild AIGI/inpainting benchmarks.
Significance. If the empirical numbers hold, the work shows that architectural and training advances in recent VFMs can be directly exploited for AIGI detection with only a lightweight, tunable pooling head. The comprehensive cross-VFM evaluation and the simple TAP redesign could become a useful baseline for future forensics research.
major comments (2)
- [Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.
- [Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.
minor comments (2)
- [Section 3] Clarify the exact list of VFMs, their input resolutions, and pretraining objectives in a single table for reproducibility.
- [Method] Specify the precise formulation of TAP (number of attention heads, learnable parameters, initialization) and whether it is trained from scratch or fine-tuned.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our generalization claims and the need for fuller experimental details. We address each major comment below and will revise the manuscript to incorporate clarifications and additional experiments.
read point-by-point responses
-
Referee: [Abstract and experimental protocol] The central generalization claim (features from current VFMs plus TAP remain discriminative for future unseen generators) is load-bearing for the 'in-the-wild' SOTA assertion, yet the described protocol only evaluates on existing benchmarks without an explicit hold-out generator family, architecture-shift, or post-processing-shift test. This matches the known rapid degradation pattern in the AIGI literature.
Authors: We agree that an explicit hold-out test for future generators would strengthen the generalization claim. While our benchmarks already incorporate images from diverse, unseen generative models (as the VFMs were pretrained primarily on real data), we will add a new experiment in the revised manuscript using a recent hold-out generator family (e.g., a post-2023 diffusion model excluded from the original benchmarks) along with architecture-shift and post-processing tests to directly validate the claim. revision: yes
-
Referee: [Results and experimental setup] Soundness of the reported accuracy gains cannot be assessed without the full experimental details: data splits, number of runs, statistical tests, and whether any post-hoc model selection occurred across the VFM suite.
Authors: We concur that complete experimental details are necessary for assessing the results. In the revised manuscript, we will expand the experimental section and add an appendix with: precise data splits for each benchmark, the number of independent runs with mean/std, applied statistical tests (e.g., significance testing for accuracy differences), and explicit confirmation that all VFMs were evaluated uniformly without post-hoc selection. revision: yes
Circularity Check
No circularity; empirical benchmarking on held-out test sets with independent VFM features
full rationale
The paper performs an empirical benchmark of out-of-the-box VFM patch tokens across multiple models and proposes a simple classifier-head redesign (TAP) whose parameters are trained on the detection task. All reported gains are measured on held-out test splits of standard AIGI benchmarks; no equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation, or input definition. The derivation chain is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained vision foundation models produce patch tokens whose statistics differ between real and AI-generated images.
Reference graph
Works this paper leans on
-
[1]
Ahmed Abdullah, Nikolas Ebert, and Oliver Wasenm ¨uller. Dualsight: Learning to disentangle artifact and semantic fea- tures for detection of diffusion-generated images. InInter- national Conference on Pattern Recognition (ICPR), 2026. 3, 6
work page 2026
-
[2]
Flexivit: One model for all patch sizes
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 4
work page 2023
-
[3]
Ollin Boer Bohan. Megalith-10m: A dataset of public domain photographs.https://huggingface.co/ datasets / madebyollin / megalith - 10m, 2024. Accessed: 2026-02-26. 5
work page 2024
-
[4]
Perception encoder: The best visual embeddings are not at the output of the net- work
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work. InNeural Information Processing Systems (NeurIPS),
-
[5]
Image manipulation detection by multi-view multi-scale supervision
Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InInternational Conference on Computer Vi- sion (ICCV), 2021. 6
work page 2021
-
[6]
Xception: Deep learning with depthwise separable convolutions
Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5
work page 2017
-
[7]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 2, 4
work page 2023
-
[8]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InConference on Computer Vision and Pattern Recognition (CVPR), 2009. 4, 5
work page 2009
-
[9]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations (ICLR), 2021. 2, 4
work page 2021
-
[10]
PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023
Nikolas Ebert, Didier Stricker, and Oliver Wasenm ¨uller. PLG-ViT: Vision transformer with parallel local and global self-attention.Sensors, 23(7):3447, 2023. 3
work page 2023
-
[11]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML),
-
[12]
Leveraging fre- quency analysis for deep fake image recognition
Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInterna- tional Conference on Machine Learning (ICML), 2020. 2
work page 2020
-
[13]
Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion
Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 6
work page 2023
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 5
work page 2016
-
[15]
Masked autoencoders are scal- able vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
work page 2022
-
[16]
Lora: Low- rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 3
work page 2022
-
[17]
Progressive growing of gans for improved quality, stability, and variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations (ICLR), 2018. 3
work page 2018
-
[18]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 3
work page 2019
-
[19]
Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection
Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion (ECCV), 2024. 2, 3, 6, 7
work page 2024
-
[20]
Learning jpeg compression artifacts for image manipulation detection and localization
Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung- Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision (IJCV), 2022. 6
work page 2022
-
[21]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3
work page 2024
-
[22]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 3
work page 2025
-
[23]
Detecting generated images by real im- ages
Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real im- ages. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 6
work page 2022
-
[24]
Spatial- phase shallow learning: rethinking face forgery detection in frequency domain
Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial- phase shallow learning: rethinking face forgery detection in frequency domain. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 5
work page 2021
-
[25]
Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Trans- actions on Circuits and Systems for Video Technology, 2022. 6
work page 2022
-
[26]
Global tex- ture enhancement for fake face detection in the wild
Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global tex- ture enhancement for fake face detection in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 6
work page 2020
-
[27]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5
work page 2021
-
[28]
Gener- alizing face forgery detection with high-frequency features
Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16317–16326, 2021. 5
work page 2021
-
[29]
arXiv preprint arXiv:2307.14863 (2023)
Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Ham- madi, and Jizhe Zhou. Iml-vit: Benchmarking image ma- nipulation localization by vision transformer.arXiv preprint arXiv:2307.14863, 2023. 6
-
[30]
Towards uni- versal fake image detectors that generalize across genera- tive models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 5, 6, 7
work page 2023
-
[31]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 7
work page internal anchor Pith review arXiv 2023
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 3
work page 2023
-
[33]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1
work page 2024
-
[34]
Thinking in frequency: Face forgery detection by mining frequency-aware clues
Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean Conference on Computer Vision (ECCV). Springer, 2020. 5
work page 2020
-
[35]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 2, 3, 4, 7, 8
work page 2021
-
[36]
Sam 2: Seg- ment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4, 7
work page 2025
-
[37]
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with vq-vae-2.Neural Information Processing Systems (NeurIPS), 2019. 3
work page 2019
-
[38]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 1, 3, 5
work page 2022
-
[39]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4, 7
work page internal anchor Pith review arXiv 2025
-
[40]
De- clip: Decoding clip representations for deepfake localization
Stefan Smeu, Elisabeta Oneata, and Dan Oneata. De- clip: Decoding clip representations for deepfake localization. InWinter Conference on Applications of Computer Vision (WACV), 2025. 3, 6
work page 2025
-
[41]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[42]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InAAAI conference on Artificial Intelli- gence, 2024. 2, 6
work page 2024
-
[43]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3, 5, 6
work page 2024
-
[44]
Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InAAAI conference on Artificial Intelligence, 2025. 2, 3
work page 2025
-
[45]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning (ICML), 2021. 5
work page 2021
-
[46]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 4, 7, 8
work page internal anchor Pith review arXiv 2025
-
[47]
Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Neural Information Processing Sys- tems (NeurIPS), 2017. 3
work page 2017
-
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems (NeurIPS), 2017. 4
work page 2017
-
[49]
Ob- jectformer for image manipulation detection and localiza- tion
Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 6
work page 2022
-
[50]
Cnn-generated images are sur- prisingly easy to spot
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are sur- prisingly easy to spot... for now. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 3, 5, 6
work page 2020
-
[51]
Opensdi: Spotting diffusion-generated images in the open world
Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. Opensdi: Spotting diffusion-generated images in the open world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3, 4, 6, 7, 8
work page 2025
-
[52]
Dire for diffusion-generated image detection
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InInternational Con- ference on Computer Vision (ICCV, 2023. 5, 6
work page 2023
-
[53]
A sanity check for ai- generated image detection
Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection. InInternational Conference on Learning Representations (ICLR), 2025. 1, 3, 4, 5, 6, 7
work page 2025
-
[54]
Deepfake detection that generalizes across benchmarks
Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. In Winter Conference on Applications of Computer Vision (WACV), 2026. 2
work page 2026
-
[55]
Low-rank few- shot adaptation of vision-language models
Maxime Zanella and Ismail Ben Ayed. Low-rank few- shot adaptation of vision-language models. InConference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. 3
work page 2024
-
[56]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 4
work page 2022
-
[57]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InIn- ternational Conference on Computer Vision (ICCV), 2023. 1
work page 2023
-
[58]
Detect- ing and simulating artifacts in gan fake images
Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InIEEE in- ternational workshop on information forensics and security (WIFS), 2019. 3
work page 2019
-
[59]
Detect- ing and simulating artifacts in gan fake images
Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect- ing and simulating artifacts in gan fake images. InInter- national Workshop on Information Forensics and Security (WIFS). IEEE, 2019. 5
work page 2019
-
[60]
Patchcraft: Exploring texture patch for efficient ai-generated image detection
Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023. 5, 6
-
[61]
Breaking latent prior bias in detectors for generaliz- able aigc image detection
Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. Breaking latent prior bias in detectors for generaliz- able aigc image detection. InNeural Information Processing Systems (NeurIPS), 2025. 5, 6
work page 2025
-
[62]
Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jin- hua Zeng, and Bin Li. Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025. 4
-
[63]
Gen- det: Towards good generalizations for ai-generated image detection
Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023. 5
-
[64]
Genimage: A million-scale benchmark for detecting ai-generated image
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeural Information Pro- cessing Systems (NeurIPS), 2023. 1, 4, 5, 6
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.