pith. sign in

arxiv: 2604.04608 · v1 · submitted 2026-04-06 · 💻 cs.CV

Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionphysical image featuressynthetic detectionCLIP multimodal learningfeature selectiondeepfake detectionuniversal descriptorsvision-language models
0
0 comments X

The pith

Five physical image features can distinguish AI-generated fakes from real images across diverse generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing deepfake detectors often overfit to particular generative models and lose effectiveness on new ones. This paper asks whether objective pixel-level physical properties can supply more stable discrimination between natural and synthetic images. The authors test 15 candidate features on more than 20 datasets spanning GANs and diffusion models, then use a feature-selection procedure to isolate five that remain consistently informative. These five descriptors are turned into text values and supplied to CLIP alongside semantic captions so that the model learns image-text alignments grounded in physical authenticity. The resulting detector reaches state-of-the-art accuracy on Genimage benchmarks, including 99.8 percent on Wukong and SDv1.4.

Core claim

The paper shows that a compact set of five physical features—Laplacian variance, Sobel statistics, residual noise variance and two additional descriptors chosen by the selection algorithm—maintain strong discriminative power across every tested dataset and architecture. When these features are encoded as text and combined with semantic captions to steer CLIP’s representation learning, the multimodal model achieves near-perfect detection accuracy while reducing dependence on purely language-based cues.

What carries the argument

The novel feature-selection algorithm that extracts five stable physical descriptors (Laplacian variance, Sobel statistics, residual noise variance and two others) and converts them into text-encoded values for guiding CLIP’s image-text representation learning.

If this is right

  • The detector attains state-of-the-art performance on multiple Genimage benchmarks.
  • Accuracy reaches 99.8 percent on the Wukong and SDv1.4 datasets.
  • Physical features reduce overfitting to any single generative family.
  • Pixel-level authenticity signals improve the reliability of vision-language models.
  • The approach suggests a route toward mitigating hallucinations in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the five features prove universal, they could be extracted directly from raw pixels without retraining for each new generator.
  • Similar physical descriptors might be developed for video or audio synthesis detection.
  • Encoding physical measurements as text could be tested in other vision-language architectures to measure gains in factual grounding.

Load-bearing premise

The five selected physical features will remain stable and discriminative for generative models and architectures beyond the more than twenty datasets examined in the study.

What would settle it

Running the detector on images produced by a new generative architecture released after the experiments and observing whether accuracy falls well below the reported 99 percent level on the original benchmarks.

Figures

Figures reproduced from arXiv: 2604.04608 by Jianqiang Zhao, Mei Qiu, Yanyun Qu.

Figure 1
Figure 1. Figure 1: Overall workflow of the proposed synthetic image detection framework. (a) En￾hanced caption preparation: Integrate input images’ core physical features into captions to enrich representation. (b) Training: Use enhanced captions and class prompts to train the model for real/AI-generated image discrimination. (c) Testing: Input images into the trained model to predict AI generation. The training and testing … view at source ↗
Figure 2
Figure 2. Figure 2: Stability (Ss, left) and discriminability (Sd, right) scores of image features for fake image detection. Based on the thresholds, set features to Core Feature(green), Usable Feature(orange), red Unstable High-Discrim(red), and Unusable Feature(grey). 3.3 Train Clip with Enhanced Caption by Merging Core Features As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Density distributions of four core features (Laplacian variance, Sobel mag￾nitude mean/std, LBP variance) over ADM/Midjourney (GenImage) and Big￾GAN/CRN/Glide (UniversalFakeDetect). Blue = real images, red = fake images. UniversalFakeDetect subset. AUC is computed per feature using Logistic Regression. For detection, we adopt the CLIP ViT-L/14 with LoRA (following C2pClip [27]), replacing only the captions… view at source ↗
Figure 4
Figure 4. Figure 4: Image-text pairs with original texts generated by ClipCap and enhanced by core features’ description. patterns are robust to dataset variations, validating the utility of these core features for downstream fake image detection. Enhanced Captions with Core Features’ Description. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative evaluation of image-text similarity based on pre-trained Clip model. fake images. Third, the interplay between the class prompt and the feature-based tex￾tual descriptions has not been systematically explored; understanding this interaction is essential to disentangle their respective contributions and avoid confounding effects. Finally, while our method demonstrates strong performance on the … view at source ↗
Figure 6
Figure 6. Figure 6: Cosine similarity comparison between two caption styles across 7 datasets. Blue/orange: results using caption “The major features are:”. Green/red: results using caption “The physical features are:” [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Image-text cosine similarity distributions of ADM [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Image-text cosine similarity distributions of BigGAN [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Image-text cosine similarity distributions of SDv1.5 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Image-text cosine similarity distributions of VQDM [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Image-text cosine similarity distributions of Wukong [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Image-text cosine similarity distributions of Midjourney [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Image-text cosine similarity distributions of Glide [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
read the original abstract

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a comprehensive analysis of 15 physical features across more than 20 datasets from GANs and diffusion models yields a novel feature selection algorithm identifying five core descriptors (Laplacian variance, Sobel statistics, residual noise variance, and two unspecified others) with consistent discriminative power between real and synthetic images. These features are quantized into text tokens and fused with semantic captions inside a CLIP-based multimodal model, producing state-of-the-art detection results including 99.8% accuracy on benchmarks such as Wukong and SDv1.4.

Significance. If the selected physical descriptors prove architecture-independent and the multimodal integration preserves their signal without introducing new biases, the work could meaningfully advance robust synthetic-image detection by grounding it in low-level image physics rather than model-specific artifacts. The breadth of the multi-dataset evaluation across current generative families supplies partial empirical support for cross-architecture consistency and constitutes a clear strength.

major comments (2)
  1. [§3.2] §3.2 (feature selection procedure): The novel feature selection algorithm is applied to the identical collection of more than 20 datasets later used for final evaluation. This creates a circularity risk in which the five retained features may be tuned to statistical differences observed in these specific collections rather than independently validated universal properties. The manuscript does not describe hold-out sets, cross-validation folds, or an independent validation cohort for the selection step itself.
  2. [§4] §4 (experimental results): Reported accuracies (e.g., 99.8% on Wukong and SDv1.4) are presented without accompanying baseline comparisons against standard CLIP, other physical-feature detectors, or ablations that isolate the contribution of the five physical descriptors versus the semantic branch. No error bars, statistical significance tests, or dataset-bias diagnostics are supplied, rendering the claim of consistent superiority difficult to evaluate.
minor comments (2)
  1. [Abstract] The abstract refers to 'five core physical features including Laplacian variance, Sobel statistics, and residual noise variance' but does not enumerate the remaining two; the main text should supply their exact definitions and the selection criteria (thresholds, ranking metric) at the first mention.
  2. [§3.3] The conversion of continuous physical descriptors into discrete text tokens for CLIP is described only at a high level; a short appendix or subsection clarifying the quantization scheme and any information loss would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, acknowledging valid concerns and outlining specific revisions to improve the manuscript's rigor.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (feature selection procedure): The novel feature selection algorithm is applied to the identical collection of more than 20 datasets later used for final evaluation. This creates a circularity risk in which the five retained features may be tuned to statistical differences observed in these specific collections rather than independently validated universal properties. The manuscript does not describe hold-out sets, cross-validation folds, or an independent validation cohort for the selection step itself.

    Authors: We acknowledge this concern about potential circularity in feature selection. Although the five features (Laplacian variance, Sobel statistics, residual noise variance, and the two additional descriptors detailed in §3.2) demonstrate consistent discriminative power across more than 20 datasets spanning multiple GAN and diffusion architectures, this does not fully eliminate the risk of dataset-specific tuning. In the revised manuscript, we will update §3.2 to describe a hold-out validation procedure: feature selection will be performed on 70% of the dataset collections, with the retained features then validated for consistency on the remaining 30% hold-out sets. Cross-validation folds and independent cohort metrics will be reported to confirm universality. revision: yes

  2. Referee: [§4] §4 (experimental results): Reported accuracies (e.g., 99.8% on Wukong and SDv1.4) are presented without accompanying baseline comparisons against standard CLIP, other physical-feature detectors, or ablations that isolate the contribution of the five physical descriptors versus the semantic branch. No error bars, statistical significance tests, or dataset-bias diagnostics are supplied, rendering the claim of consistent superiority difficult to evaluate.

    Authors: We agree that the experimental section requires additional controls and statistical support to substantiate the performance claims. The revised §4 will incorporate: (i) direct comparisons to standard CLIP and other physical-feature detectors, (ii) ablation studies separating the physical descriptor branch from the semantic captions, (iii) error bars computed over multiple random seeds, (iv) statistical significance tests (e.g., McNemar's test or paired t-tests), and (v) dataset-bias diagnostics including per-architecture breakdowns. These additions will allow clearer evaluation of the multimodal fusion's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper conducts an empirical survey of 15 physical features over >20 datasets from existing GANs and diffusion models, applies a novel selection procedure to retain five that show consistent separation on those same datasets, encodes the selected values as text, and fuses them into CLIP for downstream detection. No equation, theorem, or load-bearing claim is shown to be definitionally equivalent to its own inputs; the selection step is an explicit algorithmic contribution whose output is then validated by reported accuracy numbers rather than presupposed. The generalization claim to future generators is an empirical hypothesis, not a self-referential derivation. Self-citations are not invoked to justify uniqueness or to close the argument.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that a small set of pixel-level physical measures can be selected to generalize across generative models, with the selection process itself likely depending on data-driven choices.

free parameters (1)
  • Feature selection criteria and thresholds
    The novel algorithm that narrows 15 features to five core ones requires choices of scoring or cutoff values that are fitted or tuned on the tested datasets.
axioms (1)
  • domain assumption Physical descriptors such as Laplacian variance and residual noise variance have consistent discriminative power independent of specific generative architectures
    Invoked when claiming the five features work across all tested GAN and diffusion datasets and will generalize further.

pith-pipeline@v0.9.0 · 5575 in / 1539 out tokens · 64174 ms · 2026-05-10T18:48:21.249959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig

    Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Title Suppressed Due to Excessive Length 13 Fig. 9: Image-text cosine similarity distributions of SDv1.5 Fig. 10: Image-text cosine similarity distributions of VQDM Fig. 11: Image-text cosin...

  2. [2]

    In: European conference on computer vision

    Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? under- standing properties that generalize. In: European conference on computer vision. pp. 103–120. Springer (2020)

  3. [3]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, L., Zhang, Y., Song, Y., Liu, L., Wang, J.: Self-supervised learning of adver- sarial example: Towards good generalizations for deepfake detection. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18710–18719 (2022) 14 Mei Qiu, Jianqiang Zhao, and Yanyun Qu Fig. 12: Image-text cosine similarity distributio...

  4. [4]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cheng, S., Lyu, L., Wang, Z., Zhang, X., Sehwag, V.: Co-spy: Combining semantic and pixel features to detect synthetic images by ai. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13455–13465 (2025)

  5. [5]

    Advances in neural information processing systems33, 3022–3032 (2020)

    Dzanic, T., Shah, K., Witherden, F.: Fourier spectrum discrepancies in deep net- work generated images. Advances in neural information processing systems33, 3022–3032 (2020)

  6. [6]

    In: International conference on machine learning

    Frank, J., Eisenhofer, T., Sch¨ onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leverag- ing frequency analysis for deep fake image recognition. In: International conference on machine learning. pp. 3247–3258. PMLR (2020)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips don’t lie: A gener- alisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5039–5049 (2021)

  8. [8]

    arXiv preprint arXiv:2105.14376 (2021)

    He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. arXiv preprint arXiv:2105.14376 (2021)

  9. [9]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  10. [10]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: Bilateral high-pass filters for robust deepfake detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 48–57 (2022) Title Suppressed Due to Excessive Length 15

  11. [11]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Ji, Y., Hong, Y., Zhan, J., Chen, H., Lan, J., Zhu, H., Wang, W., Zhang, L., Zhang, J.: Towards explainable fake image detection with multi-modal large language mod- els. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 4398–4407 (2025)

  12. [12]

    arXiv preprint arXiv:2602.19715 (2026)

    Kuckreja, K., Gupta, P., Khan, M.H., Dhall, A.: Pixels don’t lie (but your detec- tor might): Bootstrapping mllm-as-a-judge for trustworthy deepfake detection and reasoning supervision. arXiv preprint arXiv:2602.19715 (2026)

  13. [13]

    IEEE Transac- tions on Information Forensics and Security (2025)

    Li, Y., Liu, X., Wang, X., Lee, B.S., Wang, S., Rocha, A., Lin, W.: Fakebench: Prob- ing explainable fake image detection via large multimodal models. IEEE Transac- tions on Information Forensics and Security (2025)

  14. [14]

    Detecting multimedia gen- erated by large ai models: A survey

    Lin, L., Gupta, N., Zhang, Y., Ren, H., Liu, C.H., Ding, F., Wang, X., Li, X., Verdoliva, L., Hu, S.: Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045 (2024)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive transformer for generalizable synthetic image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10770– 10780 (2024)

  16. [16]

    IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

    Liu, J., Xie, J., Wang, Y., Zha, Z.J.: Adaptive texture and spectrum clue mining for generalizable face forgery detection. IEEE Transactions on Information Forensics and Security19, 1922–1934 (2023)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8060–8069 (2020)

  18. [18]

    arXiv preprint arXiv: 2111.09734 (2021)

    Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)

  19. [19]

    M., Chandrasekaran, S., Flenner, A., Bappy, J

    Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy- Chowdhury, A.K., Manjunath, B.: Detecting gan generated fake images using co- occurrence matrices. arXiv preprint arXiv:1903.06836 (2019)

  20. [20]

    Nguyen-Le, H.H., Tran, V.T., Nguyen, D.T., Le-Khac, N.A.: Deepfake detection across image, video, and audio: A comprehensive survey with empirical evaluation of generalization and robustness (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that gener- alize across generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24480–24489 (2023)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  23. [23]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)

  24. [24]

    Sætra, H.S.: Generative ai: Here to stay, but for good? Technology in Society75, 102372 (2023)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shao, R., Wu, T., Liu, Z.: Detecting and grounding multi-modal media manip- ulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6904–6913 (2023)

  26. [26]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18720–18729 (2022)

  27. [27]

    16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence

    Tan, C., Tao, R., Liu, H., Gu, G., Wu, B., Zhao, Y., Wei, Y.: C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. 16 Mei Qiu, Jianqiang Zhao, and Yanyun Qu In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7184–7192 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 28130–28139 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: Generalized artifacts representation for gan-generated images detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105– 12114 (2023)

  30. [30]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  31. [31]

    Theis, L.: What makes an image realistic? arXiv preprint arXiv:2403.04493 (2024)

  32. [32]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, C., Deng, W.: Representative forgery mining for fake face detection. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14923–14932 (2021)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8984–8994 (2024)

  34. [34]

    In: Proceedings of the IEEE/CVF international con- ference on computer vision

    Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gener- alizable deepfake detection. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 22412–22423 (2023)

  35. [35]

    Physics-driven spatiotemporal modeling for ai-generated video detection.arXiv preprint arXiv:2510.08073, 2025

    Zhang, S., Lian, Z., Yang, J., Li, D., Pang, G., Liu, F., Han, B., Li, S., Tan, M.: Physics-driven spatiotemporal modeling for ai-generated video detection. arXiv preprint arXiv:2510.08073 (2025)

  36. [36]

    arXiv preprint arXiv:2512.17350 (2025)

    Zhou, C., Wang, J., Li, Y., Li, L., Cao, J., Tang, S.: Beyond semantic features: Pixel-level mapping for generalized ai-generated image detection. arXiv preprint arXiv:2512.17350 (2025)

  37. [37]

    Advances in neural information processing systems36, 77771–77782 (2023)

    Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. Advances in neural information processing systems36, 77771–77782 (2023)