Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

Renshuai Tao; Xiaolong Zheng; Yipu Wang; Yuheng Ji; Yuxuan Tian; Yuyang Liu; Ziheng Qin

arxiv: 2512.12982 · v2 · submitted 2025-12-15 · 💻 cs.CV

Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

Ziheng Qin , Yuheng Ji , Renshuai Tao , Yuxuan Tian , Yuyang Liu , Yipu Wang , Xiaolong Zheng This is my paper

Pith reviewed 2026-05-16 22:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionprototype learninggeneralizationGANdiffusion modelsLow-Rank Adaptationforgery detection

0 comments

The pith

Learning a compact set of canonical forgery prototypes overcomes the Benefit then Conflict dilemma and sustains high detection accuracy as generator diversity grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that adding more generators to training data for AI-generated image detectors first boosts then harms performance because feature distributions of real and fake images overlap more and fixed pretrained encoders cannot keep up. It introduces Generator-Aware Prototype Learning to learn a small number of canonical forgery prototypes that pull synthetic images into a unified low-variance space. A two-stage training process with Low-Rank Adaptation then adapts the encoder while retaining useful pretrained features. Experiments across many GAN and diffusion generators show the resulting detector maintains superior accuracy without needing per-generator tuning.

Core claim

GAPL learns a compact set of canonical forgery prototypes to create a unified low-variance feature space that counters data heterogeneity and employs a two-stage training scheme with Low-Rank Adaptation to enhance discriminative power while preserving pretrained knowledge, establishing a more robust decision boundary that generalizes to unseen generators.

What carries the argument

Generator-Aware Prototype Learning that constrains representations around learned canonical forgery prototypes and applies two-stage Low-Rank Adaptation training.

If this is right

Detection accuracy remains stable rather than declining as the number of training generators increases.
Feature overlap between real and synthetic images decreases enough to support a single shared decision boundary.
Pretrained encoders can be efficiently adapted to new forgery types without full retraining or loss of prior knowledge.
The same framework works for both GAN and diffusion generators without separate tuning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prototype constraint could be applied to other heterogeneous detection problems such as deepfake video or audio where source variety also causes feature drift.
If the prototypes capture generator-independent forgery signals, future detectors might require far fewer labeled examples from each new generator.
Explicit modeling of generator differences through prototypes offers an alternative to simply scaling dataset size for better generalization.

Load-bearing premise

That a compact set of learned canonical forgery prototypes can sufficiently reduce data-level heterogeneity and create a unified low-variance feature space that generalizes to unseen generators without post-hoc tuning.

What would settle it

A measurable drop in accuracy when the training set is expanded with additional generators beyond those used in the reported experiments or when the detector is tested on a generator whose forgery patterns are absent from the learned prototypes.

Figures

Figures reproduced from arXiv: 2512.12982 by Renshuai Tao, Xiaolong Zheng, Yipu Wang, Yuheng Ji, Yuxuan Tian, Yuyang Liu, Ziheng Qin.

**Figure 1.** Figure 1: Illustration of the “benefit then conflict” phenomenon we identify. As more generators added to train the detector, it first benefit [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: T-SNE visualization on CLIP Embeddings. (a) Comparison between images generated by a single generator and real images. (b) Comparison between images generated by thousands of diverse generators and real images. Moreover, in-depth experiments demonstrate the excellent robustness of our detector against post processing, thoroughly highlighting its applicability in real world. Our contributions are summariz… view at source ↗

**Figure 3.** Figure 3: Results of the experiment on a series of datasets com [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The overall framework of our GAPL. We train the GAPL in two stages. In the first stage, we train a MLP and extract embeddings for PCA decomposition, we only retain top N components as prototypes; in the second stage, we uses LoRA to finetune the encoder and map image embedding to prototypes to get the final logits. process can be formulated as follow: f = ϕ(x), yˆ = MLP (Normalize(f)), (7) where ϕ(·) is th… view at source ↗

**Figure 5.** Figure 5: Ablation study regarding the strategy of extracting proto [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Average attention scores on the validation set. Column in red and green means prototypes extracted with real images and generated images, respectively. We collected images that exhibit high attention score for a specific prototype and discovered that they have visual features in common. totypes, we clustered the images that exhibited particularly high attention scores for a specific prototype. We find that… view at source ↗

**Figure 8.** Figure 8: Examples of test subsets, we visualize some in-domain datasets with our training set along with some out-of-distribution sets. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Self-Attention map between original CLIP backbone and our finetuned backbone. There are 24 ViT blocks in the image encoder, we plot 8 blocks in each row, with indices increasing from left to right. For clarity of visualization, we use bicubic interpolation between image patches. In the shallow layers, we preserve most semantic features, in deep layers, our attention includes a wider range compared to origi… view at source ↗

**Figure 10.** Figure 10: Self-Attention map between original CLIP backbone and our finetuned backbone. There are 24 ViT blocks in the image encoder, we plot 8 blocks in each row, with indices increasing from left to right. For clarity of visualization, we use bicubic interpolation between image patches. In the shallow layers, we preserve most semantic features, in deep layers, our attention includes a wider range compared to orig… view at source ↗

read the original abstract

The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAPL diagnoses the scaling drop-off in AIGI detection and uses prototypes plus two-stage LoRA to push accuracy higher, but the gains on unseen generators rest on an assumption that still needs tighter checks.

read the letter

The main point is that this paper names a practical problem: detector performance rises then falls as you pull in more generators, which they label the Benefit then Conflict dilemma. They trace it to growing overlap in real and fake feature distributions plus the limits of a frozen pretrained backbone. Their Generator-Aware Prototype Learning tries to fix both by learning a small set of canonical forgery prototypes that pull the space into lower variance, then running a two-stage LoRA adaptation so the encoder can adjust without losing its original knowledge. The experiments report better numbers across GAN and diffusion sources, and the code is public, which helps anyone who wants to test it directly. That combination of diagnosis and a structured fix is the clearest new piece here. The soft spot sits in the generalization claim. The prototypes are fit on the training distribution, so they could end up encoding cues that are specific to the generators seen during training rather than truly invariant artifacts. The abstract gives no breakdown of data splits, no mention of statistical tests, and no ablation on how many prototypes are needed or how sensitive the results are to that choice. Until those details are shown, it is hard to know whether the reported edge survives on generators that were never in the training mix. This work is aimed at people who build or deploy AIGI detectors for moderation or forensics. Readers who already work with prototype methods or parameter-efficient adaptation will see a concrete application they can build on. It deserves peer review because the core framing is grounded in observed behavior and the method is reproducible enough to check. Referees should press for clearer held-out generator protocols and sensitivity results before the claims are taken as settled.

Referee Report

3 major / 2 minor

Summary. The paper identifies a 'Benefit then Conflict' dilemma in scaling AI-generated image detectors, where performance stagnates or degrades with added generator diversity due to data-level heterogeneity (overlapping real/synthetic features) and model-level bottlenecks from fixed pretrained encoders. It proposes Generator-Aware Prototype Learning (GAPL), which learns a compact set of canonical forgery prototypes to enforce a unified low-variance feature space and employs a two-stage LoRA-based adaptation scheme to enhance discriminability while retaining pretrained knowledge, claiming SOTA detection accuracy across GAN and diffusion generators with released code.

Significance. If the generalization claims hold under rigorous controls, GAPL offers a practical framework for handling increasing generator heterogeneity without full retraining, potentially advancing universal AIGI detectors. The two-stage adaptation and prototype constraint are conceptually sound for preserving knowledge while reducing variance; code release aids reproducibility, though the empirical nature (no parameter-free derivations) limits theoretical impact.

major comments (3)

[Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.
[Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.
[Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.

minor comments (2)

[Abstract] Abstract: The 'Benefit then Conflict dilemma' is presented as a novel diagnosis but would benefit from a brief citation or explicit contrast to prior heterogeneity analyses in AIGI literature.
[Method] Notation: Prototype and LoRA parameters are introduced without a clear summary table of free parameters (number of prototypes, rank), which would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested details on experimental protocols, prototype validation, and result robustness.

read point-by-point responses

Referee: [Experiments] Experiments section: The SOTA claims lack any description of data splits (e.g., how training generators are partitioned from test generators), statistical significance tests, or explicit controls for generator overlap; this directly undermines verification of the central generalization claim that prototypes transfer to unseen models without post-hoc tuning.

Authors: We agree that explicit documentation of the data partitioning is required to verify the generalization claims. In the revised manuscript we will add a dedicated subsection describing the generator splits, confirming that training and test generators are fully disjoint with no overlap. We will also report statistical significance via paired t-tests across runs and include explicit controls (e.g., a table listing training versus held-out generators) to demonstrate that prototypes are evaluated on completely unseen models. revision: yes
Referee: [Method] Method (prototype learning): The core premise that a small learned prototype set creates a 'unified, low-variance feature space' generalizing beyond training generators is not supported by ablation on prototype count or analysis showing invariance vs. training-specific artifacts (e.g., frequency patterns); joint optimization with the classifier risks encoding dataset statistics rather than canonical forgeries.

Authors: We acknowledge the need for stronger empirical support of the prototype mechanism. We will add an ablation study varying prototype count (e.g., 1–16) and report both accuracy and feature-space variance metrics. To address invariance, we will include comparative analyses of frequency spectra and intra-class variance on training versus unseen generators, showing that the learned prototypes reduce overlap without encoding generator-specific artifacts. We will further clarify that the two-stage LoRA procedure first adapts the encoder while freezing the prototype layer, thereby limiting the risk of merely memorizing dataset statistics. revision: yes
Referee: [Results] Results: No tables or figures report variance across multiple runs or controls for LoRA rank sensitivity, making it impossible to assess whether reported accuracy gains are robust or depend on unstated hyperparameter choices.

Authors: We agree that variance reporting and hyperparameter sensitivity are essential. The revised results section will include mean and standard deviation computed over five independent runs with different random seeds for all main tables. We will also add a sensitivity table for LoRA rank (ranks 4, 8, 16, 32), demonstrating that performance remains stable and that the reported gains are not artifacts of a single hyperparameter setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated by external experiments

full rationale

The paper presents GAPL as a practical two-stage training method that learns forgery prototypes and applies LoRA adaptation to a pretrained encoder. No equations, derivations, or self-citation chains are provided in the manuscript that reduce any claimed prediction or unified feature space back to fitted training quantities by construction. The central claims rest on reported experimental accuracy across held-out GAN and diffusion generators rather than on any self-definitional loop or imported uniqueness theorem. This is the standard case of an empirical detector whose generalization is tested externally and therefore receives a zero circularity score.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that pretrained encoders hold transferable knowledge worth preserving and that a small number of prototypes can capture forgery patterns across heterogeneous generators.

free parameters (2)

number of prototypes
Compact set size chosen to balance coverage of forgery patterns against variance reduction; value not specified in abstract.
LoRA rank and adaptation parameters
Hyperparameters controlling the capacity of the two-stage adaptation; selected to enhance discriminative power while retaining pretrained features.

axioms (2)

domain assumption Fixed pretrained encoders cannot adapt to rising complexity of diverse generators without modification.
Invoked to justify the model-level bottleneck diagnosis and the need for LoRA.
domain assumption A structured prototype-based constraint can create a unified low-variance feature space.
Central to the data-heterogeneity solution in the proposed framework.

pith-pipeline@v0.9.0 · 5538 in / 1324 out tokens · 30660 ms · 2026-05-16T22:03:26.285399+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild
cs.CV 2026-04 unverdicted novelty 4.0

HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026...
Robust Deepfake Detection, NTIRE 2026 Challenge: Report
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Midjourney.Inhttps://www.midjourney.com/ home/, 2022. 4, 6, 1

work page 2022
[2]

5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022

Wukong, 2022. 5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022. 5. 1

work page 2022
[3]

Adobe firefly.https://firefly.adobe

Adobe. Adobe firefly.https://firefly.adobe. com/, 2025. Accessed: 2025-11-04. 1

work page 2025
[4]

Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Re- verse engineering of generative models: Inferring model hy- perparameters from generated images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15477– 15493, 2023. 3

work page 2023
[5]

Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution

Songran Bai, Yuheng Ji, Yue Liu, Xingwei Zhang, Xiaolong Zheng, and Daniel Dajun Zeng. Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 11436–11444, 2025. 4

work page 2025
[6]

Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025. 4

work page arXiv 2025
[7]

Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024

Quentin Bammey. Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024. 6, 1

work page 2024
[8]

FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025

Black Forest Labs. FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025. Ac- cessed: 2025-11-26. 4

work page 2025
[9]

Large scale gan training for high fi- delity natural image synthesis

Andrew Brock et al. Large scale gan training for high fi- delity natural image synthesis. InInternational Conference on Learning Representations, 2018. 1

work page 2018
[10]

What makes fake images detectable? understanding prop- erties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding prop- erties that generalize. InEuropean Conference on Computer Vision, 2020. 7

work page 2020
[11]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 3, 6, 2, 5

work page 2024
[12]

Dual data alignment makes ai- generated image detector easier generalizable

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taip- ing Yao, and Shouhong Ding. Dual data alignment makes ai- generated image detector easier generalizable. InAdvances in Neural Information Processing Systems, 2025. 2, 3

work page 2025
[13]

Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13455–13465, 2025. 3, 6, 7, 2, 5

work page 2025
[14]

Stargan: Unified generative adversarial networks for multi-domain image-to-image translation

Yunjey Choi et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018. 1

work page 2018
[15]

Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InProceedings of the 41st International Conference on Machine Learning, pages 9550–9575. PMLR, 2024. 1

work page 2024
[16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3, 1

work page 2009
[17]

Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

Prafulla Dhariwal et al. Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 1

work page 2021
[18]

An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 6

work page 2021
[19]

R. A. Fisher. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7(2):179–188, 1936. 4

work page 1936
[20]

Nano banana.https://www.nano- banana.com/, 2025

Google, Inc. Nano banana.https://www.nano- banana.com/, 2025. Accessed: 2025-08-29. 4

work page 2025
[21]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10696–10706, 2022. 1

work page 2022
[22]

A bias-free training paradigm for more general ai-generated image de- tection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image de- tection. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 18685–18694, 2025. 3, 6, 2, 5

work page 2025
[23]

Tracing hyperparameter dependencies for model parsing via learnable graph pooling network

Xiao Guo, Vishal Asnani, Sijia Liu, and Xiaoming Liu. Tracing hyperparameter dependencies for model parsing via learnable graph pooling network. InAdvances in Neural In- formation Processing Systems, pages 116899–116932. Cur- ran Associates, Inc., 2024. 3

work page 2024
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016
[25]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 5

work page 2022
[26]

Bihpf: Bilateral high-pass filters for robust deepfake detection

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2878–2887, 2022. 2

work page 2022
[27]

Enhancing adversarial robustness of vision- language models through low-rank adaptation

Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, and Xi- aolong Zheng. Enhancing adversarial robustness of vision- language models through low-rank adaptation. InProceed- ings of the 2025 International Conference on Multimedia Re- trieval, pages 550–559, 2025. 4

work page 2025
[28]

Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025

Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, et al. Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025. 4

work page arXiv 2025
[29]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 4

work page 2025
[30]

Visualtrans: A benchmark for real-world visual transformation reasoning

Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, and Xiaolong Zheng. Visualtrans: A benchmark for real-world visual transformation reasoning. arXiv preprint arXiv:2508.04043, 2025. 4

work page arXiv 2025
[31]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras et al. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018. 4, 6, 1

work page 2018
[32]

A style-based generator architecture for generative adversarial networks

Tero Karras et al. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 1

work page 2019
[33]

Kvikontent-midjourney v6

Kvikontent. Kvikontent-midjourney v6. https://huggingface.co/Kvikontent/midjourney-v6, 2023. 1

work page 2023
[34]

Improving synthetic image detection towards generalization: An image transformation perspec- tive

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspec- tive. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 6, 5

work page 2025
[35]

Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025

Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025. 2

work page 2025
[36]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 2, 3

work page 2024
[37]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 7

work page 2021
[38]

A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 7

work page 2022
[39]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Sub- mitted November 14, 2017; revised January 4, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023. 1

work page arXiv 2023
[41]

Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection

Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17006–17015, 2024. 2

work page 2024
[42]

Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025

Huaihai Lyu, Chaofan Chen, Yuheng Ji, and Changsheng Xu. Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025. 4

work page arXiv 2025
[43]

Midjourney v6.https : / / www

Midjourney, Inc. Midjourney v6.https : / / www . midjourney.com, 2025. AI model version 6.0, Ac- cessed: 2025-11-26. 4

work page 2025
[44]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 1, 2, 3, 6, 7, 5

work page 2023
[46]

Community forensics: Using thousands of generators to train fake image detectors

Jeongsoo Park and Andrew Owens. Community forensics: Using thousands of generators to train fake image detectors. InProceedings of the Computer Vision and Pattern Recog- nition Conference (CVPR), pages 8245–8257, 2025. 1, 3, 6, 5

work page 2025
[47]

Semantic image synthesis with spatially- adaptive normalization

Taesung Park et al. Semantic image synthesis with spatially- adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. 1

work page 2019
[48]

W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learn- ing Representations, 2023. 1

work page 2023
[49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6, 7

work page 2021
[50]

Aligned datasets improve detection of latent diffusion-generated images

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 2

work page 2025
[51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 4, 6, 2

work page 2022
[52]

Faceforensics++: Learning to detect manipulated facial images

Andreas Rossler et al. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11,

work page
[53]

Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023

Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023. 1

work page 2023
[54]

Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025

Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025. 4

work page arXiv 2025
[55]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 2

work page 2024
[56]

Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 2, 6, 7, 1, 5

work page 2024
[57]

C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7184–7192, 2025. 2, 3

work page 2025
[58]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. 4

work page
[59]

Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao, Xiaoshuai Hao, Yaoxu Lyu, Mingyu Cao, Junkai Zhao, Huaihai Lyu, et al. Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

work page arXiv
[60]

Df-gan: A simple and effec- tive baseline for text-to-image synthesis

Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effec- tive baseline for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16515–16525, 2022. 1

work page 2022
[61]

Galip: Generative adversarial clips for text-to-image synthe- sis

Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthe- sis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14214–14223,

work page
[62]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 4

work page arXiv 2025
[63]

Decidiffusion 2.0, 2024

DeciAI Research Team. Decidiffusion 2.0, 2024. 1

work page 2024
[64]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

work page
[65]

Cnn-generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 2, 3, 4, 6, 1, 5

work page 2020
[66]

Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025. 4

work page arXiv 2025
[67]

Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023. 2

work page arXiv 2023
[68]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2, 3

work page arXiv 2025
[69]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[70]

A sanity check for ai-generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xi- aolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 2, 3, 6, 7, 1, 5

work page arXiv 2024
[71]

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 3, 7

work page internal anchor Pith review arXiv 2024
[72]

D3: Scaling up deepfake detection by learning from discrepancy

Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. D3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 6, 7, 2, 5

work page 2025
[73]

Towards universal ai-generated image detec- tion by variational information bottleneck network

Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image detec- tion by variational information bottleneck network. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 3

work page 2025
[74]

Unpaired image-to-image translation us- ing cycle-consistent adversarial networks

Jun-Yan Zhu et al. Unpaired image-to-image translation us- ing cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017. 1

work page 2017
[75]

Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 4

work page 2023
[76]

Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024. 3, 6, 1 Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes Supplementary Material We...

work page 2024
[77]

we randomly sample 2 images from each generator in it, which consist of about 9000 generated images

for collecting, which is the same as our training dataset. we randomly sample 2 images from each generator in it, which consist of about 9000 generated images. Then we randomly sample 8,000 real images as before to construct the last dataset in the series. group ng Generator(s) 1 1 SDv1.4 2 2 SDv1.4, BigGAN 3 4 SDv1.4, BigGAN, VQDM, Glide 4 8 All GenImage...

work page
[78]

GenImage[76] provide a dataset trained on ImageNet-1k

diffusion model. GenImage[76] provide a dataset trained on ImageNet-1k. It has 8 generative models in both GANs, Diffusion Mod- els and Commercial APIs, including BigGAN [9], VQDM [21], Stable Diffusions, Wukong [2], ADM [17] and Mid- journey [1]. SynthBuster[7] provide an aligned dataset, where real im- ages and generated images are all in PNG format, wh...

work page
[79]

Benefit then Conflict

rethink up-sampling operation in most generative ar- chitecture and detect them via a interpolation pattern. UniFD[45] leverage the image encoder of CLIP for fea- ture extraction, it takes image embeddings for classification with simple KNN or linear layer.SAFE[34] extracts high frequency band as artifact with various data augmentation to build a CNN clas...

work page

[1] [1]

Midjourney.Inhttps://www.midjourney.com/ home/, 2022. 4, 6, 1

work page 2022

[2] [2]

5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022

Wukong, 2022. 5.Inhttps://xihe.mindspore.cn/ modelzoo/wukong, 2022. 5. 1

work page 2022

[3] [3]

Adobe firefly.https://firefly.adobe

Adobe. Adobe firefly.https://firefly.adobe. com/, 2025. Accessed: 2025-11-04. 1

work page 2025

[4] [4]

Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Re- verse engineering of generative models: Inferring model hy- perparameters from generated images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15477– 15493, 2023. 3

work page 2023

[5] [5]

Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution

Songran Bai, Yuheng Ji, Yue Liu, Xingwei Zhang, Xiaolong Zheng, and Daniel Dajun Zeng. Alleviating performance disparity in adversarial spatiotemporal graph learning under zero-inflated distribution. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 11436–11444, 2025. 4

work page 2025

[6] [6]

Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025. 4

work page arXiv 2025

[7] [7]

Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024

Quentin Bammey. Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 5:1–9, 2024. 6, 1

work page 2024

[8] [8]

FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025

Black Forest Labs. FLUX.1: Speeding up text-to-image gen- eration.https://blackforestlabs.ai, 2025. Ac- cessed: 2025-11-26. 4

work page 2025

[9] [9]

Large scale gan training for high fi- delity natural image synthesis

Andrew Brock et al. Large scale gan training for high fi- delity natural image synthesis. InInternational Conference on Learning Representations, 2018. 1

work page 2018

[10] [10]

What makes fake images detectable? understanding prop- erties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding prop- erties that generalize. InEuropean Conference on Computer Vision, 2020. 7

work page 2020

[11] [11]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 3, 6, 2, 5

work page 2024

[12] [12]

Dual data alignment makes ai- generated image detector easier generalizable

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taip- ing Yao, and Shouhong Ding. Dual data alignment makes ai- generated image detector easier generalizable. InAdvances in Neural Information Processing Systems, 2025. 2, 3

work page 2025

[13] [13]

Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13455–13465, 2025. 3, 6, 7, 2, 5

work page 2025

[14] [14]

Stargan: Unified generative adversarial networks for multi-domain image-to-image translation

Yunjey Choi et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018. 1

work page 2018

[15] [15]

Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InProceedings of the 41st International Conference on Machine Learning, pages 9550–9575. PMLR, 2024. 1

work page 2024

[16] [16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3, 1

work page 2009

[17] [17]

Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

Prafulla Dhariwal et al. Diffusion models beat gans on im- age synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 1

work page 2021

[18] [18]

An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 6

work page 2021

[19] [19]

R. A. Fisher. The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7(2):179–188, 1936. 4

work page 1936

[20] [20]

Nano banana.https://www.nano- banana.com/, 2025

Google, Inc. Nano banana.https://www.nano- banana.com/, 2025. Accessed: 2025-08-29. 4

work page 2025

[21] [21]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10696–10706, 2022. 1

work page 2022

[22] [22]

A bias-free training paradigm for more general ai-generated image de- tection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image de- tection. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 18685–18694, 2025. 3, 6, 2, 5

work page 2025

[23] [23]

Tracing hyperparameter dependencies for model parsing via learnable graph pooling network

Xiao Guo, Vishal Asnani, Sijia Liu, and Xiaoming Liu. Tracing hyperparameter dependencies for model parsing via learnable graph pooling network. InAdvances in Neural In- formation Processing Systems, pages 116899–116932. Cur- ran Associates, Inc., 2024. 3

work page 2024

[24] [24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016

[25] [25]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 5

work page 2022

[26] [26]

Bihpf: Bilateral high-pass filters for robust deepfake detection

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2878–2887, 2022. 2

work page 2022

[27] [27]

Enhancing adversarial robustness of vision- language models through low-rank adaptation

Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, and Xi- aolong Zheng. Enhancing adversarial robustness of vision- language models through low-rank adaptation. InProceed- ings of the 2025 International Conference on Multimedia Re- trieval, pages 550–559, 2025. 4

work page 2025

[28] [28]

Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025

Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, et al. Mathsticks: A benchmark for visual symbolic compositional reasoning with matchstick puzzles.arXiv preprint arXiv:2510.00483, 2025. 4

work page arXiv 2025

[29] [29]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 4

work page 2025

[30] [30]

Visualtrans: A benchmark for real-world visual transformation reasoning

Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, and Xiaolong Zheng. Visualtrans: A benchmark for real-world visual transformation reasoning. arXiv preprint arXiv:2508.04043, 2025. 4

work page arXiv 2025

[31] [31]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras et al. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018. 4, 6, 1

work page 2018

[32] [32]

A style-based generator architecture for generative adversarial networks

Tero Karras et al. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 1

work page 2019

[33] [33]

Kvikontent-midjourney v6

Kvikontent. Kvikontent-midjourney v6. https://huggingface.co/Kvikontent/midjourney-v6, 2023. 1

work page 2023

[34] [34]

Improving synthetic image detection towards generalization: An image transformation perspec- tive

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspec- tive. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 6, 5

work page 2025

[35] [35]

Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025

Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. Ferretnet: Efficient synthetic image detection via local pixel dependencies, 2025. 2

work page 2025

[36] [36]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 2, 3

work page 2024

[37] [37]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 7

work page 2021

[38] [38]

A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2022. 7

work page 2022

[39] [39]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Sub- mitted November 14, 2017; revised January 4, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023. 1

work page arXiv 2023

[41] [41]

Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection

Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lareˆ2: Latent reconstruction error based method for diffusion-generated image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17006–17015, 2024. 2

work page 2024

[42] [42]

Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025

Huaihai Lyu, Chaofan Chen, Yuheng Ji, and Changsheng Xu. Egoprompt: Prompt pool learning for egocentric action recognition.arXiv preprint arXiv:2508.03266, 2025. 4

work page arXiv 2025

[43] [43]

Midjourney v6.https : / / www

Midjourney, Inc. Midjourney v6.https : / / www . midjourney.com, 2025. AI model version 6.0, Ac- cessed: 2025-11-26. 4

work page 2025

[44] [44]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [45]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 1, 2, 3, 6, 7, 5

work page 2023

[46] [46]

Community forensics: Using thousands of generators to train fake image detectors

Jeongsoo Park and Andrew Owens. Community forensics: Using thousands of generators to train fake image detectors. InProceedings of the Computer Vision and Pattern Recog- nition Conference (CVPR), pages 8245–8257, 2025. 1, 3, 6, 5

work page 2025

[47] [47]

Semantic image synthesis with spatially- adaptive normalization

Taesung Park et al. Semantic image synthesis with spatially- adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. 1

work page 2019

[48] [48]

W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learn- ing Representations, 2023. 1

work page 2023

[49] [49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6, 7

work page 2021

[50] [50]

Aligned datasets improve detection of latent diffusion-generated images

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 2

work page 2025

[51] [51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 4, 6, 2

work page 2022

[52] [52]

Faceforensics++: Learning to detect manipulated facial images

Andreas Rossler et al. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11,

work page

[53] [53]

Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023

Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky 2.2.https: //github.com/ai-forever/Kandinsky-2, 2023. 1

work page 2023

[54] [54]

Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025

Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025. 4

work page arXiv 2025

[55] [55]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 2

work page 2024

[56] [56]

Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 2, 6, 7, 1, 5

work page 2024

[57] [57]

C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7184–7192, 2025. 2, 3

work page 2025

[58] [58]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. 4

work page

[59] [59]

Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao, Xiaoshuai Hao, Yaoxu Lyu, Mingyu Cao, Junkai Zhao, Huaihai Lyu, et al. Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration.arXiv preprint arXiv:2510.26536,

work page arXiv

[60] [60]

Df-gan: A simple and effec- tive baseline for text-to-image synthesis

Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effec- tive baseline for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16515–16525, 2022. 1

work page 2022

[61] [61]

Galip: Generative adversarial clips for text-to-image synthe- sis

Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthe- sis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14214–14223,

work page

[62] [62]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 4

work page arXiv 2025

[63] [63]

Decidiffusion 2.0, 2024

DeciAI Research Team. Decidiffusion 2.0, 2024. 1

work page 2024

[64] [64]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

work page

[65] [65]

Cnn-generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 2, 3, 4, 6, 1, 5

work page 2020

[66] [66]

Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. Towards cross-view point correspondence in vision- language models.arXiv preprint arXiv:2512.04686, 2025. 4

work page arXiv 2025

[67] [67]

Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection.arXiv preprint arXiv:2303.09295, 2023. 2

work page arXiv 2023

[68] [68]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2, 3

work page arXiv 2025

[69] [69]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page

[70] [70]

A sanity check for ai-generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xi- aolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 2, 3, 6, 7, 1, 5

work page arXiv 2024

[71] [71]

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 3, 7

work page internal anchor Pith review arXiv 2024

[72] [72]

D3: Scaling up deepfake detection by learning from discrepancy

Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. D3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 6, 7, 2, 5

work page 2025

[73] [73]

Towards universal ai-generated image detec- tion by variational information bottleneck network

Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image detec- tion by variational information bottleneck network. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 3

work page 2025

[74] [74]

Unpaired image-to-image translation us- ing cycle-consistent adversarial networks

Jun-Yan Zhu et al. Unpaired image-to-image translation us- ing cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017. 1

work page 2017

[75] [75]

Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 4

work page 2023

[76] [76]

Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36, 2024. 3, 6, 1 Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes Supplementary Material We...

work page 2024

[77] [77]

we randomly sample 2 images from each generator in it, which consist of about 9000 generated images

for collecting, which is the same as our training dataset. we randomly sample 2 images from each generator in it, which consist of about 9000 generated images. Then we randomly sample 8,000 real images as before to construct the last dataset in the series. group ng Generator(s) 1 1 SDv1.4 2 2 SDv1.4, BigGAN 3 4 SDv1.4, BigGAN, VQDM, Glide 4 8 All GenImage...

work page

[78] [78]

GenImage[76] provide a dataset trained on ImageNet-1k

diffusion model. GenImage[76] provide a dataset trained on ImageNet-1k. It has 8 generative models in both GANs, Diffusion Mod- els and Commercial APIs, including BigGAN [9], VQDM [21], Stable Diffusions, Wukong [2], ADM [17] and Mid- journey [1]. SynthBuster[7] provide an aligned dataset, where real im- ages and generated images are all in PNG format, wh...

work page

[79] [79]

Benefit then Conflict

rethink up-sampling operation in most generative ar- chitecture and detect them via a interpolation pattern. UniFD[45] leverage the image encoder of CLIP for fea- ture extraction, it takes image embeddings for classification with simple KNN or linear layer.SAFE[34] extracts high frequency band as artifact with various data augmentation to build a CNN clas...

work page