Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

Jianqiang Zhao; Mei Qiu; Yanyun Qu

arxiv: 2604.04608 · v1 · submitted 2026-04-06 · 💻 cs.CV

Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

Mei Qiu , Jianqiang Zhao , Yanyun Qu This is my paper

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionphysical image featuressynthetic detectionCLIP multimodal learningfeature selectiondeepfake detectionuniversal descriptorsvision-language models

0 comments

The pith

Five physical image features can distinguish AI-generated fakes from real images across diverse generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing deepfake detectors often overfit to particular generative models and lose effectiveness on new ones. This paper asks whether objective pixel-level physical properties can supply more stable discrimination between natural and synthetic images. The authors test 15 candidate features on more than 20 datasets spanning GANs and diffusion models, then use a feature-selection procedure to isolate five that remain consistently informative. These five descriptors are turned into text values and supplied to CLIP alongside semantic captions so that the model learns image-text alignments grounded in physical authenticity. The resulting detector reaches state-of-the-art accuracy on Genimage benchmarks, including 99.8 percent on Wukong and SDv1.4.

Core claim

The paper shows that a compact set of five physical features—Laplacian variance, Sobel statistics, residual noise variance and two additional descriptors chosen by the selection algorithm—maintain strong discriminative power across every tested dataset and architecture. When these features are encoded as text and combined with semantic captions to steer CLIP’s representation learning, the multimodal model achieves near-perfect detection accuracy while reducing dependence on purely language-based cues.

What carries the argument

The novel feature-selection algorithm that extracts five stable physical descriptors (Laplacian variance, Sobel statistics, residual noise variance and two others) and converts them into text-encoded values for guiding CLIP’s image-text representation learning.

If this is right

The detector attains state-of-the-art performance on multiple Genimage benchmarks.
Accuracy reaches 99.8 percent on the Wukong and SDv1.4 datasets.
Physical features reduce overfitting to any single generative family.
Pixel-level authenticity signals improve the reliability of vision-language models.
The approach suggests a route toward mitigating hallucinations in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the five features prove universal, they could be extracted directly from raw pixels without retraining for each new generator.
Similar physical descriptors might be developed for video or audio synthesis detection.
Encoding physical measurements as text could be tested in other vision-language architectures to measure gains in factual grounding.

Load-bearing premise

The five selected physical features will remain stable and discriminative for generative models and architectures beyond the more than twenty datasets examined in the study.

What would settle it

Running the detector on images produced by a new generative architecture released after the experiments and observing whether accuracy falls well below the reported 99 percent level on the original benchmarks.

Figures

Figures reproduced from arXiv: 2604.04608 by Jianqiang Zhao, Mei Qiu, Yanyun Qu.

**Figure 1.** Figure 1: Overall workflow of the proposed synthetic image detection framework. (a) Enhanced caption preparation: Integrate input images’ core physical features into captions to enrich representation. (b) Training: Use enhanced captions and class prompts to train the model for real/AI-generated image discrimination. (c) Testing: Input images into the trained model to predict AI generation. The training and testing … view at source ↗

**Figure 2.** Figure 2: Stability (Ss, left) and discriminability (Sd, right) scores of image features for fake image detection. Based on the thresholds, set features to Core Feature(green), Usable Feature(orange), red Unstable High-Discrim(red), and Unusable Feature(grey). 3.3 Train Clip with Enhanced Caption by Merging Core Features As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Density distributions of four core features (Laplacian variance, Sobel magnitude mean/std, LBP variance) over ADM/Midjourney (GenImage) and BigGAN/CRN/Glide (UniversalFakeDetect). Blue = real images, red = fake images. UniversalFakeDetect subset. AUC is computed per feature using Logistic Regression. For detection, we adopt the CLIP ViT-L/14 with LoRA (following C2pClip [27]), replacing only the captions… view at source ↗

**Figure 4.** Figure 4: Image-text pairs with original texts generated by ClipCap and enhanced by core features’ description. patterns are robust to dataset variations, validating the utility of these core features for downstream fake image detection. Enhanced Captions with Core Features’ Description. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Quantitative evaluation of image-text similarity based on pre-trained Clip model. fake images. Third, the interplay between the class prompt and the feature-based textual descriptions has not been systematically explored; understanding this interaction is essential to disentangle their respective contributions and avoid confounding effects. Finally, while our method demonstrates strong performance on the … view at source ↗

**Figure 6.** Figure 6: Cosine similarity comparison between two caption styles across 7 datasets. Blue/orange: results using caption “The major features are:”. Green/red: results using caption “The physical features are:” [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Image-text cosine similarity distributions of ADM [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Image-text cosine similarity distributions of BigGAN [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Image-text cosine similarity distributions of SDv1.5 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Image-text cosine similarity distributions of VQDM [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Image-text cosine similarity distributions of Wukong [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Image-text cosine similarity distributions of Midjourney [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Image-text cosine similarity distributions of Glide [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

read the original abstract

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds five low-level physical stats that separate real from fake images across 20+ current datasets and folds their text encodings into CLIP, but the selection process and long-term stability are not shown to be robust.

read the letter

The core contribution is a search across 15 physical image measures on more than 20 GAN and diffusion datasets that settles on five descriptors—Laplacian variance, Sobel statistics, residual noise variance and two others—that appear to discriminate consistently. These are turned into text tokens and added to CLIP training alongside captions, with reported accuracies reaching 99.8% on some Genimage sets like Wukong and SDv1.4. The scale of the dataset sweep and the attempt to ground multimodal detection in pixel-level cues are the parts that actually move the needle beyond standard semantic-only approaches.

Referee Report

2 major / 2 minor

Summary. The paper claims that a comprehensive analysis of 15 physical features across more than 20 datasets from GANs and diffusion models yields a novel feature selection algorithm identifying five core descriptors (Laplacian variance, Sobel statistics, residual noise variance, and two unspecified others) with consistent discriminative power between real and synthetic images. These features are quantized into text tokens and fused with semantic captions inside a CLIP-based multimodal model, producing state-of-the-art detection results including 99.8% accuracy on benchmarks such as Wukong and SDv1.4.

Significance. If the selected physical descriptors prove architecture-independent and the multimodal integration preserves their signal without introducing new biases, the work could meaningfully advance robust synthetic-image detection by grounding it in low-level image physics rather than model-specific artifacts. The breadth of the multi-dataset evaluation across current generative families supplies partial empirical support for cross-architecture consistency and constitutes a clear strength.

major comments (2)

[§3.2] §3.2 (feature selection procedure): The novel feature selection algorithm is applied to the identical collection of more than 20 datasets later used for final evaluation. This creates a circularity risk in which the five retained features may be tuned to statistical differences observed in these specific collections rather than independently validated universal properties. The manuscript does not describe hold-out sets, cross-validation folds, or an independent validation cohort for the selection step itself.
[§4] §4 (experimental results): Reported accuracies (e.g., 99.8% on Wukong and SDv1.4) are presented without accompanying baseline comparisons against standard CLIP, other physical-feature detectors, or ablations that isolate the contribution of the five physical descriptors versus the semantic branch. No error bars, statistical significance tests, or dataset-bias diagnostics are supplied, rendering the claim of consistent superiority difficult to evaluate.

minor comments (2)

[Abstract] The abstract refers to 'five core physical features including Laplacian variance, Sobel statistics, and residual noise variance' but does not enumerate the remaining two; the main text should supply their exact definitions and the selection criteria (thresholds, ranking metric) at the first mention.
[§3.3] The conversion of continuous physical descriptors into discrete text tokens for CLIP is described only at a high level; a short appendix or subsection clarifying the quantization scheme and any information loss would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, acknowledging valid concerns and outlining specific revisions to improve the manuscript's rigor.

read point-by-point responses

Referee: [§3.2] §3.2 (feature selection procedure): The novel feature selection algorithm is applied to the identical collection of more than 20 datasets later used for final evaluation. This creates a circularity risk in which the five retained features may be tuned to statistical differences observed in these specific collections rather than independently validated universal properties. The manuscript does not describe hold-out sets, cross-validation folds, or an independent validation cohort for the selection step itself.

Authors: We acknowledge this concern about potential circularity in feature selection. Although the five features (Laplacian variance, Sobel statistics, residual noise variance, and the two additional descriptors detailed in §3.2) demonstrate consistent discriminative power across more than 20 datasets spanning multiple GAN and diffusion architectures, this does not fully eliminate the risk of dataset-specific tuning. In the revised manuscript, we will update §3.2 to describe a hold-out validation procedure: feature selection will be performed on 70% of the dataset collections, with the retained features then validated for consistency on the remaining 30% hold-out sets. Cross-validation folds and independent cohort metrics will be reported to confirm universality. revision: yes
Referee: [§4] §4 (experimental results): Reported accuracies (e.g., 99.8% on Wukong and SDv1.4) are presented without accompanying baseline comparisons against standard CLIP, other physical-feature detectors, or ablations that isolate the contribution of the five physical descriptors versus the semantic branch. No error bars, statistical significance tests, or dataset-bias diagnostics are supplied, rendering the claim of consistent superiority difficult to evaluate.

Authors: We agree that the experimental section requires additional controls and statistical support to substantiate the performance claims. The revised §4 will incorporate: (i) direct comparisons to standard CLIP and other physical-feature detectors, (ii) ablation studies separating the physical descriptor branch from the semantic captions, (iii) error bars computed over multiple random seeds, (iv) statistical significance tests (e.g., McNemar's test or paired t-tests), and (v) dataset-bias diagnostics including per-architecture breakdowns. These additions will allow clearer evaluation of the multimodal fusion's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper conducts an empirical survey of 15 physical features over >20 datasets from existing GANs and diffusion models, applies a novel selection procedure to retain five that show consistent separation on those same datasets, encodes the selected values as text, and fuses them into CLIP for downstream detection. No equation, theorem, or load-bearing claim is shown to be definitionally equivalent to its own inputs; the selection step is an explicit algorithmic contribution whose output is then validated by reported accuracy numbers rather than presupposed. The generalization claim to future generators is an empirical hypothesis, not a self-referential derivation. Self-citations are not invoked to justify uniqueness or to close the argument.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that a small set of pixel-level physical measures can be selected to generalize across generative models, with the selection process itself likely depending on data-driven choices.

free parameters (1)

Feature selection criteria and thresholds
The novel algorithm that narrows 15 features to five core ones requires choices of scoring or cutoff values that are fitted or tuned on the tested datasets.

axioms (1)

domain assumption Physical descriptors such as Laplacian variance and residual noise variance have consistent discriminative power independent of specific generative architectures
Invoked when claiming the five features work across all tested GAN and diffusion datasets and will generalize further.

pith-pipeline@v0.9.0 · 5575 in / 1539 out tokens · 64174 ms · 2026-05-10T18:48:21.249959+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

These features are then converted into text-encoded values and integrated with semantic captions to guide image-text representation learning in CLIP

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig

Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig. 9: Image-text cosine similarity distributions of SDv1.5 Fig. 10: Image-text cosine similarity distributions of VQDM Fig. 11: Image-text cosin...

work page 2022
[2]

In: European conference on computer vision

Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? under- standing properties that generalize. In: European conference on computer vision. pp. 103–120. Springer (2020)

work page 2020
[3]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, L., Zhang, Y., Song, Y., Liu, L., Wang, J.: Self-supervised learning of adver- sarial example: Towards good generalizations for deepfake detection. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18710–18719 (2022) 14 Mei Qiu, Jianqiang Zhao, and Yanyun Qu Fig. 12: Image-text cosine similarity distributio...

work page 2022
[4]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cheng, S., Lyu, L., Wang, Z., Zhang, X., Sehwag, V.: Co-spy: Combining semantic and pixel features to detect synthetic images by ai. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13455–13465 (2025)

work page 2025
[5]

Advances in neural information processing systems33, 3022–3032 (2020)

Dzanic, T., Shah, K., Witherden, F.: Fourier spectrum discrepancies in deep net- work generated images. Advances in neural information processing systems33, 3022–3032 (2020)

work page 2020
[6]

In: International conference on machine learning

Frank, J., Eisenhofer, T., Sch¨ onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leverag- ing frequency analysis for deep fake image recognition. In: International conference on machine learning. pp. 3247–3258. PMLR (2020)

work page 2020
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips don’t lie: A gener- alisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5039–5049 (2021)

work page 2021
[8]

arXiv preprint arXiv:2105.14376 (2021)

He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. arXiv preprint arXiv:2105.14376 (2021)

work page arXiv 2021
[9]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[10]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: Bilateral high-pass filters for robust deepfake detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 48–57 (2022) Title Suppressed Due to Excessive Length 15

work page 2022
[11]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Ji, Y., Hong, Y., Zhan, J., Chen, H., Lan, J., Zhu, H., Wang, W., Zhang, L., Zhang, J.: Towards explainable fake image detection with multi-modal large language mod- els. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 4398–4407 (2025)

work page 2025
[12]

arXiv preprint arXiv:2602.19715 (2026)

Kuckreja, K., Gupta, P., Khan, M.H., Dhall, A.: Pixels don’t lie (but your detec- tor might): Bootstrapping mllm-as-a-judge for trustworthy deepfake detection and reasoning supervision. arXiv preprint arXiv:2602.19715 (2026)

work page arXiv 2026
[13]

IEEE Transac- tions on Information Forensics and Security (2025)

Li, Y., Liu, X., Wang, X., Lee, B.S., Wang, S., Rocha, A., Lin, W.: Fakebench: Prob- ing explainable fake image detection via large multimodal models. IEEE Transac- tions on Information Forensics and Security (2025)

work page 2025
[14]

Detecting multimedia gen- erated by large ai models: A survey

Lin, L., Gupta, N., Zhang, Y., Ren, H., Liu, C.H., Ding, F., Wang, X., Li, X., Verdoliva, L., Hu, S.: Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045 (2024)

work page arXiv 2024
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive transformer for generalizable synthetic image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10770– 10780 (2024)

work page 2024
[16]

IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

Liu, J., Xie, J., Wang, Y., Zha, Z.J.: Adaptive texture and spectrum clue mining for generalizable face forgery detection. IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

work page 1922
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8060–8069 (2020)

work page 2020
[18]

arXiv preprint arXiv: 2111.09734 (2021)

Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)

work page arXiv 2021
[19]

M., Chandrasekaran, S., Flenner, A., Bappy, J

Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy- Chowdhury, A.K., Manjunath, B.: Detecting gan generated fake images using co- occurrence matrices. arXiv preprint arXiv:1903.06836 (2019)

work page arXiv 1903
[20]

Nguyen-Le, H.H., Tran, V.T., Nguyen, D.T., Le-Khac, N.A.: Deepfake detection across image, video, and audio: A comprehensive survey with empirical evaluation of generalization and robustness (2025)

work page 2025
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that gener- alize across generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24480–24489 (2023)

work page 2023
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[23]

In: Proceedings of the IEEE/CVF international conference on computer vision

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)

work page 2019
[24]

Sætra, H.S.: Generative ai: Here to stay, but for good? Technology in Society75, 102372 (2023)

work page 2023
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shao, R., Wu, T., Liu, Z.: Detecting and grounding multi-modal media manip- ulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6904–6913 (2023)

work page 2023
[26]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18720–18729 (2022)

work page 2022
[27]

16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence

Tan, C., Tao, R., Liu, H., Gu, G., Wu, B., Zhao, Y., Wei, Y.: C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. 16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7184–7192 (2025)

work page 2025
[28]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 28130–28139 (2024)

work page 2024
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: Generalized artifacts representation for gan-generated images detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105– 12114 (2023)

work page 2023
[30]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Theis, L.: What makes an image realistic? arXiv preprint arXiv:2403.04493 (2024)

work page arXiv 2024
[32]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, C., Deng, W.: Representative forgery mining for fake face detection. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14923–14932 (2021)

work page 2021
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8984–8994 (2024)

work page 2024
[34]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gener- alizable deepfake detection. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 22412–22423 (2023)

work page 2023
[35]

Physics-driven spatiotemporal modeling for ai-generated video detection.arXiv preprint arXiv:2510.08073, 2025

Zhang, S., Lian, Z., Yang, J., Li, D., Pang, G., Liu, F., Han, B., Li, S., Tan, M.: Physics-driven spatiotemporal modeling for ai-generated video detection. arXiv preprint arXiv:2510.08073 (2025)

work page arXiv 2025
[36]

arXiv preprint arXiv:2512.17350 (2025)

Zhou, C., Wang, J., Li, Y., Li, L., Cao, J., Tang, S.: Beyond semantic features: Pixel-level mapping for generalized ai-generated image detection. arXiv preprint arXiv:2512.17350 (2025)

work page arXiv 2025
[37]

Advances in neural information processing systems36, 77771–77782 (2023)

Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. Advances in neural information processing systems36, 77771–77782 (2023)

work page 2023

[1] [1]

In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig

Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig. 9: Image-text cosine similarity distributions of SDv1.5 Fig. 10: Image-text cosine similarity distributions of VQDM Fig. 11: Image-text cosin...

work page 2022

[2] [2]

In: European conference on computer vision

Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? under- standing properties that generalize. In: European conference on computer vision. pp. 103–120. Springer (2020)

work page 2020

[3] [3]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, L., Zhang, Y., Song, Y., Liu, L., Wang, J.: Self-supervised learning of adver- sarial example: Towards good generalizations for deepfake detection. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18710–18719 (2022) 14 Mei Qiu, Jianqiang Zhao, and Yanyun Qu Fig. 12: Image-text cosine similarity distributio...

work page 2022

[4] [4]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cheng, S., Lyu, L., Wang, Z., Zhang, X., Sehwag, V.: Co-spy: Combining semantic and pixel features to detect synthetic images by ai. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13455–13465 (2025)

work page 2025

[5] [5]

Advances in neural information processing systems33, 3022–3032 (2020)

Dzanic, T., Shah, K., Witherden, F.: Fourier spectrum discrepancies in deep net- work generated images. Advances in neural information processing systems33, 3022–3032 (2020)

work page 2020

[6] [6]

In: International conference on machine learning

Frank, J., Eisenhofer, T., Sch¨ onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leverag- ing frequency analysis for deep fake image recognition. In: International conference on machine learning. pp. 3247–3258. PMLR (2020)

work page 2020

[7] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips don’t lie: A gener- alisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5039–5049 (2021)

work page 2021

[8] [8]

arXiv preprint arXiv:2105.14376 (2021)

He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. arXiv preprint arXiv:2105.14376 (2021)

work page arXiv 2021

[9] [9]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020

[10] [10]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: Bilateral high-pass filters for robust deepfake detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 48–57 (2022) Title Suppressed Due to Excessive Length 15

work page 2022

[11] [11]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Ji, Y., Hong, Y., Zhan, J., Chen, H., Lan, J., Zhu, H., Wang, W., Zhang, L., Zhang, J.: Towards explainable fake image detection with multi-modal large language mod- els. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 4398–4407 (2025)

work page 2025

[12] [12]

arXiv preprint arXiv:2602.19715 (2026)

Kuckreja, K., Gupta, P., Khan, M.H., Dhall, A.: Pixels don’t lie (but your detec- tor might): Bootstrapping mllm-as-a-judge for trustworthy deepfake detection and reasoning supervision. arXiv preprint arXiv:2602.19715 (2026)

work page arXiv 2026

[13] [13]

IEEE Transac- tions on Information Forensics and Security (2025)

Li, Y., Liu, X., Wang, X., Lee, B.S., Wang, S., Rocha, A., Lin, W.: Fakebench: Prob- ing explainable fake image detection via large multimodal models. IEEE Transac- tions on Information Forensics and Security (2025)

work page 2025

[14] [14]

Detecting multimedia gen- erated by large ai models: A survey

Lin, L., Gupta, N., Zhang, Y., Ren, H., Liu, C.H., Ding, F., Wang, X., Li, X., Verdoliva, L., Hu, S.: Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045 (2024)

work page arXiv 2024

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive transformer for generalizable synthetic image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10770– 10780 (2024)

work page 2024

[16] [16]

IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

Liu, J., Xie, J., Wang, Y., Zha, Z.J.: Adaptive texture and spectrum clue mining for generalizable face forgery detection. IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

work page 1922

[17] [17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8060–8069 (2020)

work page 2020

[18] [18]

arXiv preprint arXiv: 2111.09734 (2021)

Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)

work page arXiv 2021

[19] [19]

M., Chandrasekaran, S., Flenner, A., Bappy, J

Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy- Chowdhury, A.K., Manjunath, B.: Detecting gan generated fake images using co- occurrence matrices. arXiv preprint arXiv:1903.06836 (2019)

work page arXiv 1903

[20] [20]

Nguyen-Le, H.H., Tran, V.T., Nguyen, D.T., Le-Khac, N.A.: Deepfake detection across image, video, and audio: A comprehensive survey with empirical evaluation of generalization and robustness (2025)

work page 2025

[21] [21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that gener- alize across generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24480–24489 (2023)

work page 2023

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022

[23] [23]

In: Proceedings of the IEEE/CVF international conference on computer vision

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)

work page 2019

[24] [24]

Sætra, H.S.: Generative ai: Here to stay, but for good? Technology in Society75, 102372 (2023)

work page 2023

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shao, R., Wu, T., Liu, Z.: Detecting and grounding multi-modal media manip- ulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6904–6913 (2023)

work page 2023

[26] [26]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18720–18729 (2022)

work page 2022

[27] [27]

16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence

Tan, C., Tao, R., Liu, H., Gu, G., Wu, B., Zhao, Y., Wei, Y.: C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. 16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7184–7192 (2025)

work page 2025

[28] [28]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 28130–28139 (2024)

work page 2024

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: Generalized artifacts representation for gan-generated images detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105– 12114 (2023)

work page 2023

[30] [30]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Theis, L.: What makes an image realistic? arXiv preprint arXiv:2403.04493 (2024)

work page arXiv 2024

[32] [32]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, C., Deng, W.: Representative forgery mining for fake face detection. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14923–14932 (2021)

work page 2021

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8984–8994 (2024)

work page 2024

[34] [34]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gener- alizable deepfake detection. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 22412–22423 (2023)

work page 2023

[35] [35]

Physics-driven spatiotemporal modeling for ai-generated video detection.arXiv preprint arXiv:2510.08073, 2025

Zhang, S., Lian, Z., Yang, J., Li, D., Pang, G., Liu, F., Han, B., Li, S., Tan, M.: Physics-driven spatiotemporal modeling for ai-generated video detection. arXiv preprint arXiv:2510.08073 (2025)

work page arXiv 2025

[36] [36]

arXiv preprint arXiv:2512.17350 (2025)

Zhou, C., Wang, J., Li, Y., Li, L., Cao, J., Tang, S.: Beyond semantic features: Pixel-level mapping for generalized ai-generated image detection. arXiv preprint arXiv:2512.17350 (2025)

work page arXiv 2025

[37] [37]

Advances in neural information processing systems36, 77771–77782 (2023)

Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. Advances in neural information processing systems36, 77771–77782 (2023)

work page 2023