Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

Bing Fan; Bin Li; Feng Ding; Kaiqing Lin; Xinan He; Yue Zhou

arxiv: 2602.01738 · v2 · submitted 2026-02-02 · 💻 cs.CV

Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

Yue Zhou , Xinan He , Kaiqing Lin , Bing Fan , Feng Ding , Bin Li This is my paper

Pith reviewed 2026-05-16 08:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionvision foundation modelslinear classifiergeneralizationin-the-wild evaluationemergent capabilitiesforensic features

0 comments

The pith

A simple linear classifier on frozen features from vision foundation models detects AI-generated images far better than specialized detectors in real-world conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that specialized AI-generated image detectors collapse outside controlled benchmarks while a basic linear probe on frozen features from models such as Perception Encoder, MetaCLIP 2, and DINOv3 matches those detectors on standard tests and exceeds them by over 30 percent on in-the-wild data. This capability is presented as an emergent result of pre-training on large corpora that include synthetic images. Vision-language models are said to learn an explicit semantic notion of forgery while self-supervised models pick up implicit forensic cues. The work also identifies clear remaining weaknesses, including drops under recapture or transmission and blindness to VAE reconstruction and localized edits, and argues for shifting forensics toward leveraging foundation model representations instead of building narrow detectors.

Core claim

A simple linear classifier trained on the frozen features of modern Vision Foundation Models, including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art for AIGI detection. The approach matches specialized detectors on traditional benchmarks yet outperforms them by wide margins on challenging in-the-wild distributions. The authors attribute this to the models' exposure to synthetic content during pre-training, with vision-language models internalizing an explicit semantic concept of forgery and self-supervised models implicitly acquiring discriminative forensic features.

What carries the argument

Linear classifier trained on frozen features from vision foundation models such as Perception Encoder, MetaCLIP 2, and DINOv3.

If this is right

The linear probe matches specialized detectors on curated benchmarks.
It exceeds prior detectors by more than 30 percent accuracy on in-the-wild distributions.
Vision-language models learn an explicit semantic concept of forgery from pre-training.
Self-supervised models acquire implicit discriminative forensic features.
Performance still degrades under recapture, transmission, VAE reconstruction, and localized editing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detectors for other synthetic media may similarly benefit from frozen foundation-model features rather than new task-specific architectures.
Scaling the underlying foundation models further could widen the performance gap on real-world data.
Systematic auditing of pre-training corpora for synthetic content would strengthen the causal account of emergence.

Load-bearing premise

The superior real-world performance arises because the pre-training data for these foundation models already contained synthetic images.

What would settle it

Demonstrating that a vision foundation model pre-trained exclusively on real images yields comparable accuracy on in-the-wild AIGI datasets would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.01738 by Bing Fan, Bin Li, Feng Ding, Kaiqing Lin, Xinan He, Yue Zhou.

**Figure 2.** Figure 2: Robustness to Common Perturbations. Accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linear probes on recent VFMs beat specialized detectors on in-the-wild AIGI data by large margins, but the claim that this emerges from synthetic content in pre-training has no supporting evidence.

read the letter

The main thing to know is that a linear classifier on frozen features from models like DINOv3, MetaCLIP 2, and Perception Encoder matches specialized AIGI detectors on standard benchmarks and beats them by over 30% on in-the-wild sets. The paper runs this across unseen generators and real distributions, and it flags clear failure modes like recapture and localized edits. That combination of broad testing plus honest limits is the useful part. The evaluation setup appears to target the exact problem of benchmark overfitting that has held back deployment in this area. What is new is the scale of the reported gains with these particular recent foundation models rather than older backbones. The soft spot is the explanation. The authors attribute the capability to massive pre-training data that contained synthetic images, which supposedly let VLMs learn a forgery concept and SSL models learn forensic features. No model card, dataset audit, or citation is given to show that any of the tested models actually saw AI-generated images at scale. Without that check, the performance numbers stand as an observation but the causal story stays untested. This paper is for people working on practical AIGI detection who want a stronger baseline than the current specialized detectors. A reader who needs numbers on how foundation models behave out of distribution would find the results worth checking. It deserves peer review because the empirical pattern is concrete enough to be worth verifying the details on, even if the mechanistic part needs more work.

Referee Report

2 major / 2 minor

Summary. The paper claims that a simple linear classifier trained on frozen features from modern vision foundation models (Perception Encoder, MetaCLIP 2, DINOv3) achieves new state-of-the-art AIGI detection performance. It matches specialized detectors on standard benchmarks but outperforms them by over 30% accuracy on in-the-wild datasets, attributing this to emergent properties from large-scale pre-training on data containing synthetic content (VLMs internalizing semantic forgery concepts; SSL models acquiring forensic features). The work includes evaluation on traditional benchmarks, unseen generators, and in-the-wild data, while noting limitations under recapture, transmission, VAE reconstruction, and localized editing.

Significance. If the reported performance margins hold under rigorous verification, the result would support a shift toward simple probes on foundation-model features for generalizable AIGI detection rather than specialized architectures. The broad evaluation scope across benchmarks and real-world distributions provides a useful empirical baseline, and the identification of specific failure modes (recapture, editing) offers concrete directions for future work.

major comments (2)

[Abstract] Abstract: The central interpretive claim that superior in-the-wild performance is 'an emergent property driven by the massive scale of pre-training data containing synthetic content' is unsupported. No dataset audits, model-card analyses, or citations are provided to confirm the presence or quantity of AI-generated images in the pre-training corpora of Perception Encoder, MetaCLIP 2, or DINOv3. This assumption is load-bearing for the title and conclusion that 'simplicity prevails' because of foundation-model scale.
[Evaluation sections] Evaluation sections (implied by abstract description of 'comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions'): The >30% accuracy boost on in-the-wild datasets is presented without accompanying details on statistical controls, exact dataset definitions, exclusion criteria, or confidence intervals. Without these, the generalization claim cannot be fully assessed and remains vulnerable to hidden confounds in the in-the-wild splits.

minor comments (2)

[Abstract] The abstract uses 'striking margins of over 30%' without specifying the exact baseline detectors or the precise metric (accuracy, AUC, etc.) for each comparison.
[Methods] Notation for the linear classifier (e.g., whether it is a single-layer probe or includes any normalization) is not introduced in the provided summary, which could be clarified in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the interpretive framing and evaluation rigor. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central interpretive claim that superior in-the-wild performance is 'an emergent property driven by the massive scale of pre-training data containing synthetic content' is unsupported. No dataset audits, model-card analyses, or citations are provided to confirm the presence or quantity of AI-generated images in the pre-training corpora of Perception Encoder, MetaCLIP 2, or DINOv3. This assumption is load-bearing for the title and conclusion that 'simplicity prevails' because of foundation-model scale.

Authors: We agree that the claim would be strengthened by additional grounding. The manuscript presents the explanation as a posited hypothesis inferred from the scale of web-derived pre-training corpora (known to contain synthetic imagery) and the observed performance patterns across model families. We did not perform new dataset audits. In revision we will add citations to prior work documenting synthetic content in large-scale VLM and SSL training data, and we will rephrase the abstract, title-adjacent claims, and conclusion to present the account as a supported hypothesis rather than an asserted fact. This reduces the load-bearing status while preserving the core empirical result that frozen foundation-model features yield strong generalization. revision: yes
Referee: [Evaluation sections] Evaluation sections (implied by abstract description of 'comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions'): The >30% accuracy boost on in-the-wild datasets is presented without accompanying details on statistical controls, exact dataset definitions, exclusion criteria, or confidence intervals. Without these, the generalization claim cannot be fully assessed and remains vulnerable to hidden confounds in the in-the-wild splits.

Authors: We accept that additional statistical detail will improve verifiability. The full manuscript already defines the in-the-wild datasets and splits in the evaluation sections, but we will expand these to include explicit exclusion criteria, confidence intervals on all reported accuracies, and results from multiple random seeds with standard deviations. These additions will be placed in the main evaluation tables and text to allow readers to assess potential confounds directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical results with no derivations or self-referential reductions

full rationale

The paper reports empirical accuracy numbers from training a linear classifier on frozen features extracted from external Vision Foundation Models (Perception Encoder, MetaCLIP 2, DINOv3). No equations, fitted parameters, or derivation steps exist that could reduce the reported performance gains to a definition or input by construction. The interpretive claim that the capability is 'emergent' from synthetic content in pre-training is an unverified hypothesis rather than a load-bearing derivation; it does not create circularity because the performance numbers stand as direct measurements against external baselines and datasets. No self-citation chains or ansatzes are invoked to justify the core results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central explanation rests on the unverified premise that pre-training corpora contain sufficient synthetic images to produce the observed forensic capability; no free parameters or new entities are introduced.

axioms (1)

domain assumption Pre-training data of the tested vision foundation models contains synthetic images in sufficient quantity and diversity to induce forgery-discriminative features
Invoked to explain why the linear probe generalizes; no independent verification supplied in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1386 out tokens · 39882 ms · 2026-05-16T08:38:02.213018+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
cs.CV 2026-05 conditional novelty 6.0

SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Daniel Bolya, Po-Yao Huang, Peize Sun, et al . 2025. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Lvpan Cai, Haowei Wang, Jiayi Ji, YanShu ZhouMen, Shen Chen, Taiping Yao, and Xiaoshuai Sun. 2025. Zooming in on fakes: A novel dataset for localized AI-generated image detection with forgery amplification approach.arXiv preprint arXiv:2504.11922(2025)

work page arXiv 2025
[3]

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. 2024. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398(2024)

work page arXiv 2024
[4]

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. Drct: Diffu- sion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning

work page 2024
[5]

Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. 2025. Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection.arXiv preprint arXiv:2512.06746 (2025)

work page arXiv 2025
[6]

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, et al. 2025. Dual Data Alignment Makes AI- Generated Image Detector Easier Generalizable.arXiv preprint arXiv:2505.14359 (2025)

work page arXiv 2025
[7]

Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794(2022)

work page internal anchor Pith review arXiv 2022
[8]

Yung-Sung Chuang, Yang Li, Dong Wang, et al. 2025. Metaclip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062(2025)

work page arXiv 2025
[9]

Davide Cozzolino and Luisa Verdoliva. 2019. Noiseprint: A CNN-based camera model fingerprint.IEEE Transactions on Information Forensics and Security15 (2019), 144–159

work page 2019
[10]

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning. PMLR, 3247–3258

work page 2020
[11]

Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. 2022. Fusing global and local features for generalized ai-synthesized image detection. In2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3465–3469

work page 2022
[12]

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. 2025. Bridging the Gap Between Ideal and Real- world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20379–20389

work page 2025
[13]

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng

work page
[14]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Improving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2405–2414

work page
[15]

Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. 2025. Is Artificial Intelligence Generated Image Detection a Solved Problem?arXiv preprint arXiv:2505.12335(2025)

work page arXiv 2025
[16]

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

work page 2020
[17]

Scott McCloskey and Michael Albright. 2019. Detecting GAN-generated imagery using saturation cues. In2019 IEEE international conference on image processing (ICIP). IEEE, 4584–4588

work page 2019
[18]

Augustus Odena, Vincent Dumoulin, and Chris Olah. 2016. Deconvolution and Checkerboard Artifacts.Distill(2016). doi:10.23915/distill.00003

work page doi:10.23915/distill.00003 2016
[19]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

work page 2023
[20]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[22]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022.arXiv preprint arXiv:2112.10752(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Richard Sutton. 2019. The bitter lesson.Incomplete Ideas (blog)13, 1 (2019), 38

work page 2019
[25]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024
[26]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

work page 2024
[27]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, et al. 2025. Siglip 2: Multilin- gual vision-language encoders with improved semantic understanding, localiza- tion, and dense features.arXiv preprint arXiv:2502.14786(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

work page 2020
[30]

Zhendong Wang, Jianmin Bao, Wengang Zhou, et al. 2023. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22445–22455

work page 2023
[31]

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichten- hofer. 2023. Demystifying clip data.arXiv preprint arXiv:2309.16671(2023)

work page internal anchor Pith review arXiv 2023
[32]

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2024. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435(2024)

work page arXiv 2024
[33]

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2024. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection.arXiv preprint arXiv:2411.15633(2024)

work page internal anchor Pith review arXiv 2024
[34]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

work page 2023
[35]

Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. 2025. Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection.arXiv preprint arXiv:2506.00874(2025)

work page arXiv 2025
[36]

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, and Rongrong Ji. 2025. AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models.arXiv preprint arXiv:2507.02664(2025)

work page arXiv 2025
[37]

Mingjian Zhu, Hanting Chen, Qiangyu Yan, et al. 2023. Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems36 (2023), 77771–77782. 9

work page 2023

[1] [1]

Daniel Bolya, Po-Yao Huang, Peize Sun, et al . 2025. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Lvpan Cai, Haowei Wang, Jiayi Ji, YanShu ZhouMen, Shen Chen, Taiping Yao, and Xiaoshuai Sun. 2025. Zooming in on fakes: A novel dataset for localized AI-generated image detection with forgery amplification approach.arXiv preprint arXiv:2504.11922(2025)

work page arXiv 2025

[3] [3]

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. 2024. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398(2024)

work page arXiv 2024

[4] [4]

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. Drct: Diffu- sion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning

work page 2024

[5] [5]

Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. 2025. Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection.arXiv preprint arXiv:2512.06746 (2025)

work page arXiv 2025

[6] [6]

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, et al. 2025. Dual Data Alignment Makes AI- Generated Image Detector Easier Generalizable.arXiv preprint arXiv:2505.14359 (2025)

work page arXiv 2025

[7] [7]

Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794(2022)

work page internal anchor Pith review arXiv 2022

[8] [8]

Yung-Sung Chuang, Yang Li, Dong Wang, et al. 2025. Metaclip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062(2025)

work page arXiv 2025

[9] [9]

Davide Cozzolino and Luisa Verdoliva. 2019. Noiseprint: A CNN-based camera model fingerprint.IEEE Transactions on Information Forensics and Security15 (2019), 144–159

work page 2019

[10] [10]

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning. PMLR, 3247–3258

work page 2020

[11] [11]

Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. 2022. Fusing global and local features for generalized ai-synthesized image detection. In2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3465–3469

work page 2022

[12] [12]

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. 2025. Bridging the Gap Between Ideal and Real- world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20379–20389

work page 2025

[13] [13]

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng

work page

[14] [14]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Improving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2405–2414

work page

[15] [15]

Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. 2025. Is Artificial Intelligence Generated Image Detection a Solved Problem?arXiv preprint arXiv:2505.12335(2025)

work page arXiv 2025

[16] [16]

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

work page 2020

[17] [17]

Scott McCloskey and Michael Albright. 2019. Detecting GAN-generated imagery using saturation cues. In2019 IEEE international conference on image processing (ICIP). IEEE, 4584–4588

work page 2019

[18] [18]

Augustus Odena, Vincent Dumoulin, and Chris Olah. 2016. Deconvolution and Checkerboard Artifacts.Distill(2016). doi:10.23915/distill.00003

work page doi:10.23915/distill.00003 2016

[19] [19]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

work page 2023

[20] [20]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[22] [22]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022.arXiv preprint arXiv:2112.10752(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Richard Sutton. 2019. The bitter lesson.Incomplete Ideas (blog)13, 1 (2019), 38

work page 2019

[25] [25]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024

[26] [26]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

work page 2024

[27] [27]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, et al. 2025. Siglip 2: Multilin- gual vision-language encoders with improved semantic understanding, localiza- tion, and dense features.arXiv preprint arXiv:2502.14786(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

work page 2020

[30] [30]

Zhendong Wang, Jianmin Bao, Wengang Zhou, et al. 2023. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22445–22455

work page 2023

[31] [31]

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichten- hofer. 2023. Demystifying clip data.arXiv preprint arXiv:2309.16671(2023)

work page internal anchor Pith review arXiv 2023

[32] [32]

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2024. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435(2024)

work page arXiv 2024

[33] [33]

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2024. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection.arXiv preprint arXiv:2411.15633(2024)

work page internal anchor Pith review arXiv 2024

[34] [34]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

work page 2023

[35] [35]

Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, and Bin Li. 2025. Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection.arXiv preprint arXiv:2506.00874(2025)

work page arXiv 2025

[36] [36]

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, and Rongrong Ji. 2025. AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models.arXiv preprint arXiv:2507.02664(2025)

work page arXiv 2025

[37] [37]

Mingjian Zhu, Hanting Chen, Qiangyu Yan, et al. 2023. Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems36 (2023), 77771–77782. 9

work page 2023