pith. sign in

arxiv: 2512.12982 · v2 · submitted 2025-12-15 · 💻 cs.CV

Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

Pith reviewed 2026-05-16 22:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionprototype learninggeneralizationGANdiffusion modelsLow-Rank Adaptationforgery detection
0
0 comments X

The pith

Learning a compact set of canonical forgery prototypes overcomes the Benefit then Conflict dilemma and sustains high detection accuracy as generator diversity grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that adding more generators to training data for AI-generated image detectors first boosts then harms performance because feature distributions of real and fake images overlap more and fixed pretrained encoders cannot keep up. It introduces Generator-Aware Prototype Learning to learn a small number of canonical forgery prototypes that pull synthetic images into a unified low-variance space. A two-stage training process with Low-Rank Adaptation then adapts the encoder while retaining useful pretrained features. Experiments across many GAN and diffusion generators show the resulting detector maintains superior accuracy without needing per-generator tuning.

Core claim

GAPL learns a compact set of canonical forgery prototypes to create a unified low-variance feature space that counters data heterogeneity and employs a two-stage training scheme with Low-Rank Adaptation to enhance discriminative power while preserving pretrained knowledge, establishing a more robust decision boundary that generalizes to unseen generators.

What carries the argument

Generator-Aware Prototype Learning that constrains representations around learned canonical forgery prototypes and applies two-stage Low-Rank Adaptation training.

If this is right

  • Detection accuracy remains stable rather than declining as the number of training generators increases.
  • Feature overlap between real and synthetic images decreases enough to support a single shared decision boundary.
  • Pretrained encoders can be efficiently adapted to new forgery types without full retraining or loss of prior knowledge.
  • The same framework works for both GAN and diffusion generators without separate tuning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prototype constraint could be applied to other heterogeneous detection problems such as deepfake video or audio where source variety also causes feature drift.
  • If the prototypes capture generator-independent forgery signals, future detectors might require far fewer labeled examples from each new generator.
  • Explicit modeling of generator differences through prototypes offers an alternative to simply scaling dataset size for better generalization.

Load-bearing premise

That a compact set of learned canonical forgery prototypes can sufficiently reduce data-level heterogeneity and create a unified low-variance feature space that generalizes to unseen generators without post-hoc tuning.

What would settle it

A measurable drop in accuracy when the training set is expanded with additional generators beyond those used in the reported experiments or when the detector is tested on a generator whose forgery patterns are absent from the learned prototypes.

Figures

Figures reproduced from arXiv: 2512.12982 by Renshuai Tao, Xiaolong Zheng, Yipu Wang, Yuheng Ji, Yuxuan Tian, Yuyang Liu, Ziheng Qin.

Figure 1
Figure 1. Figure 1: Illustration of the “benefit then conflict” phenomenon we identify. As more generators added to train the detector, it first benefit [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: T-SNE visualization on CLIP Embeddings. (a) Com￾parison between images generated by a single generator and real images. (b) Comparison between images generated by thousands of diverse generators and real images. Moreover, in-depth experiments demonstrate the excellent robustness of our detector against post processing, thor￾oughly highlighting its applicability in real world. Our contributions are summariz… view at source ↗
Figure 3
Figure 3. Figure 3: Results of the experiment on a series of datasets com [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overall framework of our GAPL. We train the GAPL in two stages. In the first stage, we train a MLP and extract embeddings for PCA decomposition, we only retain top N components as prototypes; in the second stage, we uses LoRA to finetune the encoder and map image embedding to prototypes to get the final logits. process can be formulated as follow: f = ϕ(x), yˆ = MLP (Normalize(f)), (7) where ϕ(·) is th… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study regarding the strategy of extracting proto [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average attention scores on the validation set. Column in red and green means prototypes extracted with real images and generated images, respectively. We collected images that exhibit high attention score for a specific prototype and discovered that they have visual features in common. totypes, we clustered the images that exhibited particularly high attention scores for a specific prototype. We find that… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of test subsets, we visualize some in-domain datasets with our training set along with some out-of-distribution sets. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Self-Attention map between original CLIP backbone and our finetuned backbone. There are 24 ViT blocks in the image encoder, we plot 8 blocks in each row, with indices increasing from left to right. For clarity of visualization, we use bicubic interpolation between image patches. In the shallow layers, we preserve most semantic features, in deep layers, our attention includes a wider range compared to origi… view at source ↗
Figure 10
Figure 10. Figure 10: Self-Attention map between original CLIP backbone and our finetuned backbone. There are 24 ViT blocks in the image encoder, we plot 8 blocks in each row, with indices increasing from left to right. For clarity of visualization, we use bicubic interpolation between image patches. In the shallow layers, we preserve most semantic features, in deep layers, our attention includes a wider range compared to orig… view at source ↗
read the original abstract

The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies a 'Benefit then Conflict' dilemma in scaling AI-generated image detectors, where performance stagnates or degrades with added generator diversity due to data-level heterogeneity (overlapping real/synthetic features) and model-level bottlenecks from fixed pretrained encoders. It proposes Generator-Aware Prototype Learning (GAPL), which learns a compact set of canonical forgery prototypes to enforce a unified low-variance feature space and employs a two-stage LoRA-based adaptation scheme to enhance discriminability while retaining pretrained knowledge, claiming SOTA detection accuracy across GAN and diffusion generators with released code.

Significance. If the generalization claims hold under rigorous controls, GAPL offers a practical framework for handling increasing generator heterogeneity without full retraining, potentially advancing universal AIGI detectors. The two-stage adaptation and prototype constraint are conceptually sound for preserving knowledge while reducing variance; code release aids reproducibility, though the empirical nature (no parameter-free derivations) limits theoretical impact.

major comments (3)
  1. [Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.
  2. [Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.
  3. [Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.
minor comments (2)
  1. [Abstract] Abstract: The 'Benefit then Conflict dilemma' is presented as a novel diagnosis but would benefit from a brief citation or explicit contrast to prior heterogeneity analyses in AIGI literature.
  2. [Method] Notation: Prototype and LoRA parameters are introduced without a clear summary table of free parameters (number of prototypes, rank), which would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested details on experimental protocols, prototype validation, and result robustness.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.

    Authors: We agree that explicit documentation of the data partitioning is required to verify the generalization claims. In the revised manuscript we will add a dedicated subsection describing the generator splits, confirming that training and test generators are fully disjoint with no overlap. We will also report statistical significance via paired t-tests across runs and include explicit controls (e.g., a table listing training versus held-out generators) to demonstrate that prototypes are evaluated on completely unseen models. revision: yes

  2. Referee: [Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.

    Authors: We acknowledge the need for stronger empirical support of the prototype mechanism. We will add an ablation study varying prototype count (e.g., 1–16) and report both accuracy and feature-space variance metrics. To address invariance, we will include comparative analyses of frequency spectra and intra-class variance on training versus unseen generators, showing that the learned prototypes reduce overlap without encoding generator-specific artifacts. We will further clarify that the two-stage LoRA procedure first adapts the encoder while freezing the prototype layer, thereby limiting the risk of merely memorizing dataset statistics. revision: yes

  3. Referee: [Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.

    Authors: We agree that variance reporting and hyperparameter sensitivity are essential. The revised results section will include mean and standard deviation computed over five independent runs with different random seeds for all main tables. We will also add a sensitivity table for LoRA rank (ranks 4, 8, 16, 32), demonstrating that performance remains stable and that the reported gains are not artifacts of a single hyperparameter setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated by external experiments

full rationale

The paper presents GAPL as a practical two-stage training method that learns forgery prototypes and applies LoRA adaptation to a pretrained encoder. No equations, derivations, or self-citation chains are provided in the manuscript that reduce any claimed prediction or unified feature space back to fitted training quantities by construction. The central claims rest on reported experimental accuracy across held-out GAN and diffusion generators rather than on any self-definitional loop or imported uniqueness theorem. This is the standard case of an empirical detector whose generalization is tested externally and therefore receives a zero circularity score.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that pretrained encoders hold transferable knowledge worth preserving and that a small number of prototypes can capture forgery patterns across heterogeneous generators.

free parameters (2)
  • number of prototypes
    Compact set size chosen to balance coverage of forgery patterns against variance reduction; value not specified in abstract.
  • LoRA rank and adaptation parameters
    Hyperparameters controlling the capacity of the two-stage adaptation; selected to enhance discriminative power while retaining pretrained features.
axioms (2)
  • domain assumption Fixed pretrained encoders cannot adapt to rising complexity of diverse generators without modification.
    Invoked to justify the model-level bottleneck diagnosis and the need for LoRA.
  • domain assumption A structured prototype-based constraint can create a unified low-variance feature space.
    Central to the data-heterogeneity solution in the proposed framework.

pith-pipeline@v0.9.0 · 5538 in / 1324 out tokens · 30660 ms · 2026-05-16T22:03:26.285399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

    cs.CV 2026-04 unverdicted novelty 4.0

    HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026...

  2. Robust Deepfake Detection, NTIRE 2026 Challenge: Report

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Midjourney.Inhttps://www.midjourney.com/ home/, 2022. 4, 6, 1

  2. [2]

    5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022

    Wukong, 2022. 5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022. 5. 1

  3. [3]

    Adobe firefly.https://firefly.adobe

    Adobe. Adobe firefly.https://firefly.adobe. com/, 2025. Accessed: 2025-11-04. 1

  4. [4]

    Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Re- verse engineering of generative models: Inferring model hy- perparameters from generated images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15477– 15493, 2023. 3

  5. [5]

    Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution

    Songran Bai, Yuheng Ji, Yue Liu, Xingwei Zhang, Xiaolong Zheng, and Daniel Dajun Zeng. Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 11436–11444, 2025. 4

  6. [6]

    Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025

    Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025. 4

  7. [7]

    Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024

    Quentin Bammey. Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024. 6, 1

  8. [8]

    FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025

    Black Forest Labs. FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025. Ac- cessed: 2025-11-26. 4

  9. [9]

    Large scale gan training for high fi- delity natural image synthesis

    Andrew Brock et al. Large scale gan training for high fi- delity natural image synthesis. InInternational Conference on Learning Representations, 2018. 1

  10. [10]

    What makes fake images detectable? understanding prop- erties that generalize

    Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding prop- erties that generalize. InEuropean Conference on Computer Vision, 2020. 7

  11. [11]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 3, 6, 2, 5

  12. [12]

    Dual data alignment makes ai- generated image detector easier generalizable

    Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taip- ing Yao, and Shouhong Ding. Dual data alignment makes ai- generated image detector easier generalizable. InAdvances in Neural Information Processing Systems, 2025. 2, 3

  13. [13]

    Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai

    Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13455–13465, 2025. 3, 6, 7, 2, 5

  14. [14]

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation

    Yunjey Choi et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018. 1

  15. [15]

    Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

    Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InProceedings of the 41st International Conference on Machine Learning, pages 9550–9575. PMLR, 2024. 1

  16. [16]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3, 1

  17. [17]

    Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

    Prafulla Dhariwal et al. Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 1

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 6

  19. [19]

    R. A. Fisher. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7(2):179–188, 1936. 4

  20. [20]

    Nano banana.https://www.nano- banana.com/, 2025

    Google, Inc. Nano banana.https://www.nano- banana.com/, 2025. Accessed: 2025-08-29. 4

  21. [21]

    Vec- tor quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10696–10706, 2022. 1

  22. [22]

    A bias-free training paradigm for more general ai-generated image de- tection

    Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image de- tection. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 18685–18694, 2025. 3, 6, 2, 5

  23. [23]

    Tracing hyperparameter dependencies for model parsing via learnable graph pooling network

    Xiao Guo, Vishal Asnani, Sijia Liu, and Xiaoming Liu. Tracing hyperparameter dependencies for model parsing via learnable graph pooling network. InAdvances in Neural In- formation Processing Systems, pages 116899–116932. Cur- ran Associates, Inc., 2024. 3

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

  25. [25]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 5

  26. [26]

    Bihpf: Bilateral high-pass filters for robust deepfake detection

    Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2878–2887, 2022. 2

  27. [27]

    Enhancing adversarial robustness of vision- language models through low-rank adaptation

    Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, and Xi- aolong Zheng. Enhancing adversarial robustness of vision- language models through low-rank adaptation. InProceed- ings of the 2025 International Conference on Multimedia Re- trieval, pages 550–559, 2025. 4

  28. [28]

    Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025

    Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, et al. Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025. 4

  29. [29]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 4

  30. [30]

    Visualtrans: A benchmark for real-world visual transformation reasoning

    Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, and Xiaolong Zheng. Visualtrans: A benchmark for real-world visual transformation reasoning. arXiv preprint arXiv:2508.04043, 2025. 4

  31. [31]

    Progressive growing of gans for improved quality, stability, and variation

    Tero Karras et al. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018. 4, 6, 1

  32. [32]

    A style-based generator architecture for generative adversarial networks

    Tero Karras et al. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 1

  33. [33]

    Kvikontent-midjourney v6

    Kvikontent. Kvikontent-midjourney v6. https://huggingface.co/Kvikontent/midjourney-v6, 2023. 1

  34. [34]

    Improving synthetic image detection towards generalization: An image transformation perspec- tive

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspec- tive. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 6, 5

  35. [35]

    Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025

    Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025. 2

  36. [36]

    Forgery-aware adaptive transformer for generalizable synthetic image detection

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 2, 3

  37. [37]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 7

  38. [38]

    A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 7

  39. [39]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Sub- mitted November 14, 2017; revised January 4, 2019. 2

  40. [40]

    LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023. 1

  41. [41]

    Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection

    Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17006–17015, 2024. 2

  42. [42]

    Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025

    Huaihai Lyu, Chaofan Chen, Yuheng Ji, and Changsheng Xu. Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025. 4

  43. [43]

    Midjourney v6.https : / / www

    Midjourney, Inc. Midjourney v6.https : / / www . midjourney.com, 2025. AI model version 6.0, Ac- cessed: 2025-11-26. 4

  44. [44]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1

  45. [45]

    Towards uni- versal fake image detectors that generalize across genera- tive models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 1, 2, 3, 6, 7, 5

  46. [46]

    Community forensics: Using thousands of generators to train fake image detectors

    Jeongsoo Park and Andrew Owens. Community forensics: Using thousands of generators to train fake image detectors. InProceedings of the Computer Vision and Pattern Recog- nition Conference (CVPR), pages 8245–8257, 2025. 1, 3, 6, 5

  47. [47]

    Semantic image synthesis with spatially- adaptive normalization

    Taesung Park et al. Semantic image synthesis with spatially- adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. 1

  48. [48]

    W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

    Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learn- ing Representations, 2023. 1

  49. [49]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6, 7

  50. [50]

    Aligned datasets improve detection of latent diffusion-generated images

    Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 2

  51. [51]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 4, 6, 2

  52. [52]

    Faceforensics++: Learning to detect manipulated facial images

    Andreas Rossler et al. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11,

  53. [53]

    Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023

    Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023. 1

  54. [54]

    Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025

    Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025. 4

  55. [55]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 2

  56. [56]

    Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 2, 6, 7, 1, 5

  57. [57]

    C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7184–7192, 2025. 2, 3

  58. [58]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. 4

  59. [59]

    Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

    Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao, Xiaoshuai Hao, Yaoxu Lyu, Mingyu Cao, Junkai Zhao, Huaihai Lyu, et al. Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

  60. [60]

    Df-gan: A simple and effec- tive baseline for text-to-image synthesis

    Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effec- tive baseline for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16515–16525, 2022. 1

  61. [61]

    Galip: Generative adversarial clips for text-to-image synthe- sis

    Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthe- sis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14214–14223,

  62. [62]

    Robobrain 2.0 technical report

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 4

  63. [63]

    Decidiffusion 2.0, 2024

    DeciAI Research Team. Decidiffusion 2.0, 2024. 1

  64. [64]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

  65. [65]

    Cnn-generated images are surprisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 2, 3, 4, 6, 1, 5

  66. [66]

    Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025

    Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025. 4

  67. [67]

    Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023. 2

  68. [68]

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

    Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2, 3

  69. [69]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  70. [70]

    A sanity check for ai-generated image detection

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xi- aolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 2, 3, 6, 7, 1, 5

  71. [71]

    Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 3, 7

  72. [72]

    D3: Scaling up deepfake detection by learning from discrepancy

    Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. D3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 6, 7, 2, 5

  73. [73]

    Towards universal ai-generated image detec- tion by variational information bottleneck network

    Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image detec- tion by variational information bottleneck network. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 3

  74. [74]

    Unpaired image-to-image translation us- ing cycle-consistent adversarial networks

    Jun-Yan Zhu et al. Unpaired image-to-image translation us- ing cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017. 1

  75. [75]

    Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 4

  76. [76]

    Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024. 3, 6, 1 Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes Supplementary Material We...

  77. [77]

    we randomly sample 2 images from each generator in it, which consist of about 9000 generated images

    for collecting, which is the same as our training dataset. we randomly sample 2 images from each generator in it, which consist of about 9000 generated images. Then we randomly sample 8,000 real images as before to construct the last dataset in the series. group ng Generator(s) 1 1 SDv1.4 2 2 SDv1.4, BigGAN 3 4 SDv1.4, BigGAN, VQDM, Glide 4 8 All GenImage...

  78. [78]

    GenImage[76] provide a dataset trained on ImageNet-1k

    diffusion model. GenImage[76] provide a dataset trained on ImageNet-1k. It has 8 generative models in both GANs, Diffusion Mod- els and Commercial APIs, including BigGAN [9], VQDM [21], Stable Diffusions, Wukong [2], ADM [17] and Mid- journey [1]. SynthBuster[7] provide an aligned dataset, where real im- ages and generated images are all in PNG format, wh...

  79. [79]

    Benefit then Conflict

    rethink up-sampling operation in most generative ar- chitecture and detect them via a interpolation pattern. UniFD[45] leverage the image encoder of CLIP for fea- ture extraction, it takes image embeddings for classification with simple KNN or linear layer.SAFE[34] extracts high frequency band as artifact with various data augmentation to build a CNN clas...