Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
Pith reviewed 2026-05-16 22:03 UTC · model grok-4.3
The pith
Learning a compact set of canonical forgery prototypes overcomes the Benefit then Conflict dilemma and sustains high detection accuracy as generator diversity grows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAPL learns a compact set of canonical forgery prototypes to create a unified low-variance feature space that counters data heterogeneity and employs a two-stage training scheme with Low-Rank Adaptation to enhance discriminative power while preserving pretrained knowledge, establishing a more robust decision boundary that generalizes to unseen generators.
What carries the argument
Generator-Aware Prototype Learning that constrains representations around learned canonical forgery prototypes and applies two-stage Low-Rank Adaptation training.
If this is right
- Detection accuracy remains stable rather than declining as the number of training generators increases.
- Feature overlap between real and synthetic images decreases enough to support a single shared decision boundary.
- Pretrained encoders can be efficiently adapted to new forgery types without full retraining or loss of prior knowledge.
- The same framework works for both GAN and diffusion generators without separate tuning steps.
Where Pith is reading between the lines
- The same prototype constraint could be applied to other heterogeneous detection problems such as deepfake video or audio where source variety also causes feature drift.
- If the prototypes capture generator-independent forgery signals, future detectors might require far fewer labeled examples from each new generator.
- Explicit modeling of generator differences through prototypes offers an alternative to simply scaling dataset size for better generalization.
Load-bearing premise
That a compact set of learned canonical forgery prototypes can sufficiently reduce data-level heterogeneity and create a unified low-variance feature space that generalizes to unseen generators without post-hoc tuning.
What would settle it
A measurable drop in accuracy when the training set is expanded with additional generators beyond those used in the reported experiments or when the detector is tested on a generator whose forgery patterns are absent from the learned prototypes.
Figures
read the original abstract
The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a 'Benefit then Conflict' dilemma in scaling AI-generated image detectors, where performance stagnates or degrades with added generator diversity due to data-level heterogeneity (overlapping real/synthetic features) and model-level bottlenecks from fixed pretrained encoders. It proposes Generator-Aware Prototype Learning (GAPL), which learns a compact set of canonical forgery prototypes to enforce a unified low-variance feature space and employs a two-stage LoRA-based adaptation scheme to enhance discriminability while retaining pretrained knowledge, claiming SOTA detection accuracy across GAN and diffusion generators with released code.
Significance. If the generalization claims hold under rigorous controls, GAPL offers a practical framework for handling increasing generator heterogeneity without full retraining, potentially advancing universal AIGI detectors. The two-stage adaptation and prototype constraint are conceptually sound for preserving knowledge while reducing variance; code release aids reproducibility, though the empirical nature (no parameter-free derivations) limits theoretical impact.
major comments (3)
- [Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.
- [Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.
- [Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.
minor comments (2)
- [Abstract] Abstract: The 'Benefit then Conflict dilemma' is presented as a novel diagnosis but would benefit from a brief citation or explicit contrast to prior heterogeneity analyses in AIGI literature.
- [Method] Notation: Prototype and LoRA parameters are introduced without a clear summary table of free parameters (number of prototypes, rank), which would aid clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested details on experimental protocols, prototype validation, and result robustness.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.
Authors: We agree that explicit documentation of the data partitioning is required to verify the generalization claims. In the revised manuscript we will add a dedicated subsection describing the generator splits, confirming that training and test generators are fully disjoint with no overlap. We will also report statistical significance via paired t-tests across runs and include explicit controls (e.g., a table listing training versus held-out generators) to demonstrate that prototypes are evaluated on completely unseen models. revision: yes
-
Referee: [Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.
Authors: We acknowledge the need for stronger empirical support of the prototype mechanism. We will add an ablation study varying prototype count (e.g., 1–16) and report both accuracy and feature-space variance metrics. To address invariance, we will include comparative analyses of frequency spectra and intra-class variance on training versus unseen generators, showing that the learned prototypes reduce overlap without encoding generator-specific artifacts. We will further clarify that the two-stage LoRA procedure first adapts the encoder while freezing the prototype layer, thereby limiting the risk of merely memorizing dataset statistics. revision: yes
-
Referee: [Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.
Authors: We agree that variance reporting and hyperparameter sensitivity are essential. The revised results section will include mean and standard deviation computed over five independent runs with different random seeds for all main tables. We will also add a sensitivity table for LoRA rank (ranks 4, 8, 16, 32), demonstrating that performance remains stable and that the reported gains are not artifacts of a single hyperparameter setting. revision: yes
Circularity Check
No significant circularity; empirical framework validated by external experiments
full rationale
The paper presents GAPL as a practical two-stage training method that learns forgery prototypes and applies LoRA adaptation to a pretrained encoder. No equations, derivations, or self-citation chains are provided in the manuscript that reduce any claimed prediction or unified feature space back to fitted training quantities by construction. The central claims rest on reported experimental accuracy across held-out GAN and diffusion generators rather than on any self-definitional loop or imported uniqueness theorem. This is the standard case of an empirical detector whose generalization is tested externally and therefore receives a zero circularity score.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of prototypes
- LoRA rank and adaptation parameters
axioms (2)
- domain assumption Fixed pretrained encoders cannot adapt to rising complexity of diverse generators without modification.
- domain assumption A structured prototype-based constraint can create a unified low-variance feature space.
Forward citations
Cited by 2 Pith papers
-
HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild
HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026...
-
Robust Deepfake Detection, NTIRE 2026 Challenge: Report
The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.
Reference graph
Works this paper leans on
-
[1]
Midjourney.Inhttps://www.midjourney.com/ home/, 2022. 4, 6, 1
work page 2022
-
[2]
5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022
Wukong, 2022. 5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022. 5. 1
work page 2022
-
[3]
Adobe firefly.https://firefly.adobe
Adobe. Adobe firefly.https://firefly.adobe. com/, 2025. Accessed: 2025-11-04. 1
work page 2025
-
[4]
Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Re- verse engineering of generative models: Inferring model hy- perparameters from generated images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15477– 15493, 2023. 3
work page 2023
-
[5]
Songran Bai, Yuheng Ji, Yue Liu, Xingwei Zhang, Xiaolong Zheng, and Daniel Dajun Zeng. Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 11436–11444, 2025. 4
work page 2025
-
[6]
Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025. 4
-
[7]
Quentin Bammey. Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024. 6, 1
work page 2024
-
[8]
FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025
Black Forest Labs. FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025. Ac- cessed: 2025-11-26. 4
work page 2025
-
[9]
Large scale gan training for high fi- delity natural image synthesis
Andrew Brock et al. Large scale gan training for high fi- delity natural image synthesis. InInternational Conference on Learning Representations, 2018. 1
work page 2018
-
[10]
What makes fake images detectable? understanding prop- erties that generalize
Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding prop- erties that generalize. InEuropean Conference on Computer Vision, 2020. 7
work page 2020
-
[11]
Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 3, 6, 2, 5
work page 2024
-
[12]
Dual data alignment makes ai- generated image detector easier generalizable
Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taip- ing Yao, and Shouhong Ding. Dual data alignment makes ai- generated image detector easier generalizable. InAdvances in Neural Information Processing Systems, 2025. 2, 3
work page 2025
-
[13]
Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai
Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13455–13465, 2025. 3, 6, 7, 2, 5
work page 2025
-
[14]
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation
Yunjey Choi et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018. 1
work page 2018
-
[15]
Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InProceedings of the 41st International Conference on Machine Learning, pages 9550–9575. PMLR, 2024. 1
work page 2024
-
[16]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3, 1
work page 2009
-
[17]
Prafulla Dhariwal et al. Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 1
work page 2021
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 6
work page 2021
-
[19]
R. A. Fisher. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7(2):179–188, 1936. 4
work page 1936
-
[20]
Nano banana.https://www.nano- banana.com/, 2025
Google, Inc. Nano banana.https://www.nano- banana.com/, 2025. Accessed: 2025-08-29. 4
work page 2025
-
[21]
Vec- tor quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10696–10706, 2022. 1
work page 2022
-
[22]
A bias-free training paradigm for more general ai-generated image de- tection
Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image de- tection. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 18685–18694, 2025. 3, 6, 2, 5
work page 2025
-
[23]
Tracing hyperparameter dependencies for model parsing via learnable graph pooling network
Xiao Guo, Vishal Asnani, Sijia Liu, and Xiaoming Liu. Tracing hyperparameter dependencies for model parsing via learnable graph pooling network. InAdvances in Neural In- formation Processing Systems, pages 116899–116932. Cur- ran Associates, Inc., 2024. 3
work page 2024
-
[24]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7
work page 2016
-
[25]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 5
work page 2022
-
[26]
Bihpf: Bilateral high-pass filters for robust deepfake detection
Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2878–2887, 2022. 2
work page 2022
-
[27]
Enhancing adversarial robustness of vision- language models through low-rank adaptation
Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, and Xi- aolong Zheng. Enhancing adversarial robustness of vision- language models through low-rank adaptation. InProceed- ings of the 2025 International Conference on Multimedia Re- trieval, pages 550–559, 2025. 4
work page 2025
-
[28]
Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, et al. Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025. 4
-
[29]
Robobrain: A unified brain model for robotic manipulation from abstract to concrete
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 4
work page 2025
-
[30]
Visualtrans: A benchmark for real-world visual transformation reasoning
Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, and Xiaolong Zheng. Visualtrans: A benchmark for real-world visual transformation reasoning. arXiv preprint arXiv:2508.04043, 2025. 4
-
[31]
Progressive growing of gans for improved quality, stability, and variation
Tero Karras et al. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018. 4, 6, 1
work page 2018
-
[32]
A style-based generator architecture for generative adversarial networks
Tero Karras et al. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 1
work page 2019
-
[33]
Kvikontent. Kvikontent-midjourney v6. https://huggingface.co/Kvikontent/midjourney-v6, 2023. 1
work page 2023
-
[34]
Improving synthetic image detection towards generalization: An image transformation perspec- tive
Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspec- tive. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 6, 5
work page 2025
-
[35]
Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025
Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025. 2
work page 2025
-
[36]
Forgery-aware adaptive transformer for generalizable synthetic image detection
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 2, 3
work page 2024
-
[37]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 7
work page 2021
-
[38]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 7
work page 2022
-
[39]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Sub- mitted November 14, 2017; revised January 4, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,
Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023. 1
-
[41]
Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection
Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17006–17015, 2024. 2
work page 2024
-
[42]
Huaihai Lyu, Chaofan Chen, Yuheng Ji, and Changsheng Xu. Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025. 4
-
[43]
Midjourney, Inc. Midjourney v6.https : / / www . midjourney.com, 2025. AI model version 6.0, Ac- cessed: 2025-11-26. 4
work page 2025
-
[44]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Towards uni- versal fake image detectors that generalize across genera- tive models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 1, 2, 3, 6, 7, 5
work page 2023
-
[46]
Community forensics: Using thousands of generators to train fake image detectors
Jeongsoo Park and Andrew Owens. Community forensics: Using thousands of generators to train fake image detectors. InProceedings of the Computer Vision and Pattern Recog- nition Conference (CVPR), pages 8245–8257, 2025. 1, 3, 6, 5
work page 2025
-
[47]
Semantic image synthesis with spatially- adaptive normalization
Taesung Park et al. Semantic image synthesis with spatially- adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. 1
work page 2019
-
[48]
W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models
Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learn- ing Representations, 2023. 1
work page 2023
-
[49]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6, 7
work page 2021
-
[50]
Aligned datasets improve detection of latent diffusion-generated images
Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 2
work page 2025
-
[51]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 4, 6, 2
work page 2022
-
[52]
Faceforensics++: Learning to detect manipulated facial images
Andreas Rossler et al. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11,
-
[53]
Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023
Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023. 1
work page 2023
-
[54]
Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025. 4
-
[55]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 2
work page 2024
-
[56]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 2, 6, 7, 1, 5
work page 2024
-
[57]
Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7184–7192, 2025. 2, 3
work page 2025
-
[58]
Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. 4
-
[59]
Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao, Xiaoshuai Hao, Yaoxu Lyu, Mingyu Cao, Junkai Zhao, Huaihai Lyu, et al. Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,
-
[60]
Df-gan: A simple and effec- tive baseline for text-to-image synthesis
Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effec- tive baseline for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16515–16525, 2022. 1
work page 2022
-
[61]
Galip: Generative adversarial clips for text-to-image synthe- sis
Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthe- sis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14214–14223,
-
[62]
Robobrain 2.0 technical report
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 4
- [63]
-
[64]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,
-
[65]
Cnn-generated images are surprisingly easy to spot
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 2, 3, 4, 6, 1, 5
work page 2020
-
[66]
Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025. 4
-
[67]
Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023. 2
-
[68]
Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2, 3
-
[69]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[70]
A sanity check for ai-generated image detection
Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xi- aolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 2, 3, 6, 7, 1, 5
-
[71]
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 3, 7
work page internal anchor Pith review arXiv 2024
-
[72]
D3: Scaling up deepfake detection by learning from discrepancy
Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. D3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 6, 7, 2, 5
work page 2025
-
[73]
Towards universal ai-generated image detec- tion by variational information bottleneck network
Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image detec- tion by variational information bottleneck network. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 3
work page 2025
-
[74]
Unpaired image-to-image translation us- ing cycle-consistent adversarial networks
Jun-Yan Zhu et al. Unpaired image-to-image translation us- ing cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017. 1
work page 2017
-
[75]
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 4
work page 2023
-
[76]
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024. 3, 6, 1 Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes Supplementary Material We...
work page 2024
-
[77]
we randomly sample 2 images from each generator in it, which consist of about 9000 generated images
for collecting, which is the same as our training dataset. we randomly sample 2 images from each generator in it, which consist of about 9000 generated images. Then we randomly sample 8,000 real images as before to construct the last dataset in the series. group ng Generator(s) 1 1 SDv1.4 2 2 SDv1.4, BigGAN 3 4 SDv1.4, BigGAN, VQDM, Glide 4 8 All GenImage...
-
[78]
GenImage[76] provide a dataset trained on ImageNet-1k
diffusion model. GenImage[76] provide a dataset trained on ImageNet-1k. It has 8 generative models in both GANs, Diffusion Mod- els and Commercial APIs, including BigGAN [9], VQDM [21], Stable Diffusions, Wukong [2], ADM [17] and Mid- journey [1]. SynthBuster[7] provide an aligned dataset, where real im- ages and generated images are all in PNG format, wh...
-
[79]
rethink up-sampling operation in most generative ar- chitecture and detect them via a interpolation pattern. UniFD[45] leverage the image encoder of CLIP for fea- ture extraction, it takes image embeddings for classification with simple KNN or linear layer.SAFE[34] extracts high frequency band as artifact with various data augmentation to build a CNN clas...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.