pith. sign in

arxiv: 2605.31192 · v1 · pith:DEXFSJCGnew · submitted 2026-05-29 · 💻 cs.CV

The Regularizing Power of Language-Training Deepfake Detectors

Pith reviewed 2026-06-28 23:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake detectionmultimodal LLMregularizationcross-dataset generalizationinterpretabilityreinforcement learningdual-encoder architecture
0
0 comments X

The pith

Language training regularizes deepfake detectors by steering them toward high-level generalizable features rather than low-level dataset artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that deepfake detectors typically overfit to low-level domain-specific cues that fail to transfer across datasets. It proposes that an LLM pretrained on language will naturally favor high-level, describable artifacts that generalize better, and shows this can be exploited as a regularization mechanism. A dual-encoder architecture pairs a frozen specialist detector with a LoRA-tuned multimodal LLM encoder. Training proceeds in two stages: first binary alignment to combine features, then reinforcement learning that rewards the model for generating descriptive reasoning before classifying, using only binary labels. This produces both interpretable outputs and measurable gains in cross-dataset accuracy, even when the reasoning step is dropped at test time.

Core claim

The paper establishes that a dual-encoder architecture combining a frozen specialist detector with a LoRA-tuned MLLM encoder, trained first through binary alignment and then through reinforcement learning that incentivizes explain-then-classify behavior, enables the model to prioritize high-level robust features. This yields improved cross-dataset generalization and produces descriptive reasoning chains, with the performance benefit persisting even when those chains are omitted during inference.

What carries the argument

Dual-encoder architecture (frozen specialist detector paired with LoRA-tuned MLLM encoder) plus two-stage curriculum of binary alignment followed by RL for explain-then-classify.

If this is right

  • Cross-dataset performance exceeds prior state-of-the-art methods by a large margin.
  • The model produces human-readable reasoning chains before classifying.
  • Accuracy gains remain even when reasoning chains are removed at inference time.
  • The approach combines high-level language features where possible with low-level features only when necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curriculum could be tested on other binary classification tasks that suffer from domain-specific overfitting, such as medical image anomaly detection.
  • If the preference for describable features is the active ingredient, simpler prompting strategies without RL might achieve partial regularization gains.
  • The method suggests that interpretability and generalization can be pursued jointly rather than traded off.

Load-bearing premise

An LLM pretrained on language will intrinsically prefer high-level describable artifacts over low-level domain-specific ones, allowing the RL stage to successfully steer the model toward robust features.

What would settle it

Run the RL stage on multiple datasets and measure whether cross-dataset accuracy fails to improve beyond the binary-alignment baseline or whether the generated descriptions remain generic and non-predictive of the classification decision.

Figures

Figures reproduced from arXiv: 2605.31192 by Benedikt Hopf, Radu Timofte, Zongwei Wu.

Figure 1
Figure 1. Figure 1: Motivation. Previous methods usually (a) do not provide language output or (b) learn post-hoc explanations from su￾pervised data (human-annotated or hand￾crafted features). Using supervised finetun￾ing, reinforcement learning, and a dual￾encoder design, our method jointly learns language and detection, leading not only to interpretable descriptions but also benefit￾ing from the implicit regularization of l… view at source ↗
Figure 2
Figure 2. Figure 2: First stage: modality align￾ment. Tokens from deepfake detector, vi￾sion encoder, and text are passed to the model, asking for a one-word answer. The model outputs a probability distribution over all words, which we can supervise with binary labels and calculate binary metrics from. Note that, unlike previous work, we do not need a separate classification head and directly use the MLLM for classifica￾tion,… view at source ↗
Figure 3
Figure 3. Figure 3: Second stage: Reinforcement learning. We provide the model with the requested output structure, a question regarding the authenticity of the candidate image, and the image itself. We then sample multiple answers, judge them using our reward functions, and train using GRPO [49]. Negative advantages are discouraged, positive ones encouraged. This strengthens the alignment between the modalities, as all compo… view at source ↗
Figure 4
Figure 4. Figure 4: Out-of-domain examples. The first image is the famous Will-Smith-eating￾spaghetti example [62] by Google’s Veo 3 [10], the second one is taken from [75]. The center left is generated by Google Gemini 2.5 Flash [17], the center right and lower left by Gemini 3 Flash [46], and the final image is a still from [40]. We specifically include a failure case, showing that very high-quality images can avoid detecti… view at source ↗
read the original abstract

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that deepfake detection overfitting to low-level domain-specific artifacts can be mitigated by leveraging MLLM language pretraining as a regularizer. It introduces a dual-encoder architecture (frozen specialist detector + LoRA-tuned MLLM) trained in two stages: (1) binary alignment to combine features and reduce dataset-specific overfitting, and (2) RL that rewards 'explain-then-classify' behavior using only binary labels. This is asserted to yield both interpretable descriptions and improved cross-dataset generalization, even when reasoning is omitted at inference, with extensive experiments showing large-margin outperformance over SOTA.

Significance. If the results hold, the work would demonstrate a practical mechanism for using language pretraining bias to favor generalizable high-level features in detection tasks, simultaneously advancing interpretability and robustness without extra supervision. The two-stage curriculum and dual-encoder design are concrete contributions that could influence future multimodal regularization approaches in CV.

major comments (2)
  1. [Abstract] Abstract: The central claim that the RL stage 'explicitly incentivize[s] the model to prioritize high-level, robust features' rests on rewarding only binary classification correctness plus format compliance. No term in the reward penalizes post-hoc or non-causal explanations, so the mechanism does not demonstrably force the model away from low-level cues that still produce correct binary labels.
  2. [Abstract] Abstract (method description): The assertion that 'an LLM that has been pretrained on language would prefer high-level artifacts that can be described better' is presented as an intrinsic bias that the RL curriculum exploits, yet no ablation or analysis is referenced showing that the generated explanations are faithful to the detector's actual decision features rather than fluent but decoupled text.
minor comments (1)
  1. [Title] The title contains an apparent hyphenation inconsistency ('Language-Training').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, proposing targeted revisions to the abstract and discussion sections where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the RL stage 'explicitly incentivize[s] the model to prioritize high-level, robust features' rests on rewarding only binary classification correctness plus format compliance. No term in the reward penalizes post-hoc or non-causal explanations, so the mechanism does not demonstrably force the model away from low-level cues that still produce correct binary labels.

    Authors: We agree that the reward signal consists solely of binary correctness and format compliance and therefore does not contain an explicit penalty for post-hoc or non-causal reasoning. The claim that the RL stage 'explicitly incentivizes' prioritization of high-level features is therefore stronger than the evidence directly supports. We will revise the abstract to replace 'explicitly incentivize' with 'encourage via the explain-then-classify format' and will add a short paragraph in the discussion acknowledging that the regularization effect is indirect and could in principle be satisfied by low-level cues accompanied by fluent but non-causal text. revision: yes

  2. Referee: [Abstract] Abstract (method description): The assertion that 'an LLM that has been pretrained on language would prefer high-level artifacts that can be described better' is presented as an intrinsic bias that the RL curriculum exploits, yet no ablation or analysis is referenced showing that the generated explanations are faithful to the detector's actual decision features rather than fluent but decoupled text.

    Authors: The referee is correct that the manuscript offers no ablation or analysis (e.g., comparison with saliency maps or intervention experiments) demonstrating that the generated explanations are faithful to the features actually used by the dual-encoder rather than fluent but decoupled text. We will add this point explicitly to the limitations subsection and will outline possible verification approaches for future work, while retaining the original hypothesis as a motivating intuition rather than a proven mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and binary supervision

full rationale

The paper presents a two-stage curriculum (binary alignment then RL for explain-then-classify) whose central mechanism is the use of an externally pretrained MLLM plus binary labels to incentivize high-level features. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The intuition that language pretraining biases toward describable artifacts is stated as motivation rather than derived from prior self-work, and the RL reward is explicitly binary, leaving the generalization claim as an empirical hypothesis rather than a definitional reduction. The method is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view provides no explicit free parameters, axioms, or invented entities; the method builds on standard components (LoRA, RL with binary labels) whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5789 in / 1089 out tokens · 29391 ms · 2026-06-28T23:18:13.149848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 32 canonical work pages · 11 internal anchors

  1. [1]

    2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp

    Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp. 1–7 (2018),https://api.semanticscholar. org/CorpusID:521574751

  2. [2]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 9, 11, 4, 5

  3. [3]

    Detecting generated images by real images only,

    Bi, X., Liu, B., Yang, F., Xiao, B., Li, W., Huang, G., Cosman, P.C.: Detecting generated images by real images only. ArXivabs/2311.00962(2023),https: //api.semanticscholar.org/CorpusID:2649353248, 11

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022) 1, 3, 4

  5. [5]

    Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,

    Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N.: Antifakeprompt: Prompt-tuned vision- language models are fake image detectors. ArXivabs/2310.17419(2023),https: //api.semanticscholar.org/CorpusID:2644904908, 11

  6. [6]

    chief financial officer

    Chen, H., Magramo, K.: Finance worker pays out $25 million after video call with deepfake “chief financial officer” (Feb 2024),https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk1

  7. [7]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Chollet, F.: Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1800– 1807 (2016),https://api.semanticscholar.org/CorpusID:23751108

  8. [8]

    Cui, X., Li, Y., Zhu, D., Zhou, J., Dong, J., Lyu, S.: Forensics adapter: Unleashing clip for generalizable face forgery detection (2025),https://arxiv.org/abs/2411. 197153

  9. [9]

    com / deepfakes / faceswap (2019) 1

    Deepfakes: deepfakes_faceswap.https : / / github . com / deepfakes / faceswap (2019) 1

  10. [10]

    Deepmind, G.: Veo: a text-to-video generation system (2025) 13, 14, 2

  11. [11]

    DeepSeek-AI: Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning (2025),https://arxiv.org/abs/2501.129483, 4, 6

  12. [12]

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms (2023),https://arxiv.org/abs/2305.143146

  13. [13]

    The DeepFake Detection Challenge (DFDC) Dataset

    Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Canton-Ferrer, C.: The deepfake detection challenge dataset. ArXivabs/2006.07397(2020), https://api.semanticscholar.org/CorpusID:2196876161, 8, 10, 3

  14. [14]

    ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

    Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Canton-Ferrer, C.: The deep- fake detection challenge (dfdc) preview dataset. ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

  15. [15]

    Bootstrap methods: Another look at the jackknife,

    Efron, B.: Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics7(1), 1 – 26 (1979).https://doi.org/10.1214/aos/1176344552,https: //doi.org/10.1214/aos/11763445525

  16. [16]

    com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B

    Favreau, J., Lucas, G.: The mandalorian (2020),https://www.disneyplus. com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B. Hopf,et al

  17. [17]

    Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025),https://developers.googleblog.com/ en/introducing-gemini-2-5-flash-image/13, 14, 1, 2

  18. [18]

    Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization (2023),https://arxiv.org/abs/2303. 171114

  19. [19]

    Guo, X., Song, X., Zhang, Y., Liu, X., Liu, X.: Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector (2025),https: //arxiv.org/abs/2503.201882, 3, 5, 8, 9, 10, 11, 4, 6

  20. [20]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Guo, Y., Zhen, C., Yan, P.: Controllable guide-space for generalizable face forgery detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20761–20770 (2023),https://api.semanticscholar.org/CorpusID: 2601648913, 4

  21. [21]

    ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

    Hasanaath,A.A.,Luqman,H.,Katib,R.,Anwar,S.:Fsbi:Deepfakesdetectionwith frequency enhanced self-blended images. ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

  22. [22]

    ArXivabs/2105.14376(2021),https://api.semanticscholar

    He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. ArXivabs/2105.14376(2021),https://api.semanticscholar. org/CorpusID:2352547663

  23. [23]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

    Hopf, B., Timofte, R.: Practical manipulation model for robust deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 5675–5684 (October 2025) 1, 3, 4, 12

  24. [24]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. ArXivabs/2106.09685(2021), https://api.semanticscholar.org/CorpusID:2354580096

  25. [25]

    Huang, T.M., Lin, W.T., Hua, K.L., Cheng, W.H., Yamagishi, J., Chen, J.C.: Thinkfake: Reasoning in multimodal large language models for ai-generated image detection (2025),https://arxiv.org/abs/2509.198413, 11

  26. [26]

    2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

    Huang, Z., Hu, J., Li, X., He, Y., Zhao, X., Peng, B., Wu, B., Huang, X., Cheng, G.: Sida: Social media image deepfake detection, localization and ex- planation with large multimodal model. 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 28831–28841 (2025),https: //api.semanticscholar.org/CorpusID:2745151453, 8, 10, 11, 13

  27. [27]

    Huang, Z., Li, T., Li, X., Wen, H., He, Y., Zhang, J., Fei, H., Yang, X., Huang, X., Peng, B., Cheng, G.: So-fake: Benchmarking and explaining social media image forgery detection (2025),https://arxiv.org/abs/2505.186603, 11

  28. [28]

    Jiang, C., Dong, W., Zhang, Z., Yu, F., Peng, W., Yuan, X., Bi, Y., Zhao, M., Zhou, Z., Si, C., Shan, C.: Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection (2026),https://arxiv.org/abs/2506.009793, 11

  29. [29]

    IEEE Access8, 83144–83154 (2020),https://api.semanticscholar

    Jung,T.,Kim,S.,Kim,K.:Deepvision:Deepfakesdetectionusinghumaneyeblink- ing pattern. IEEE Access8, 83144–83154 (2020),https://api.semanticscholar. org/CorpusID:2186518781

  30. [30]

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8107–8116 (2019),https://api. semanticscholar.org/CorpusID:2092022731

  31. [31]

    Kim, T., Choi, J., Jeong, Y., Noh, H., Yoo, J., Baek, S., Choi, J.: Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection (2025), https://arxiv.org/abs/2507.023983

  32. [32]

    Kowalski, M.: Faceswap.https://github.com/MarekKowalski/FaceSwap(2018) 1, 12 The Regularizing Power of Language-Training Deepfake Detectors 17

  33. [33]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Larue, N., Vu, N.S., Struc, V., Peer, P., Christophides, V.: Seeable: Soft discrep- ancies and bounded contrastive learning for exposing deepfakes. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20954–20964 (2022), https://api.semanticscholar.org/CorpusID:2537344173, 4

  34. [34]

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5000–5009 (2019),https://api. semanticscholar.org/CorpusID:2095164241, 3, 4, 12

  35. [35]

    Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

    Li, T., Huang, Z., Wen, H., He, Y., Lyu, S., Wu, B., Cheng, G.: Raidx: A retrieval- augmented generation and grpo reinforcement learning framework for explainable deepfake detection. Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

  36. [36]

    2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp

    Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp. 3204–3213 (2019),https://api.semanticscholar. org/CorpusID:2127264301, 8, 9, 10, 3

  37. [37]

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp

    Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N.: Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp. 772–781 (2021),https://api.semanticscholar.org/CorpusID: 2320921673, 4

  38. [38]

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Luo, Y., Zhang, Y., Yan, J., Liu, W.: Generalizing face forgery detection with high- frequency features. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 16312–16321 (2021),https://api.semanticscholar. org/CorpusID:2323205993

  39. [39]

    Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two- branchrecurrentnetworkforisolatingdeepfakesinvideos.ArXivabs/2008.03412 (2020),https://api.semanticscholar.org/CorpusID:2210906633

  40. [40]

    Nep, D.: This is not morgan freeman - a deepfake singularity (2021) 13, 14

  41. [41]

    Nguyen, D., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection (2025),https: //arxiv.org/abs/2501.011843, 4

  42. [42]

    Nguyen, D., Mejri, N., Singh, I.P., Kuleshova, P., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Laa-net: Localized artifact attention network for quality-agnostic andgeneralizabledeepfakedetection.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 17395–17405 (2024) 1, 3, 9, 4, 8, 12

  43. [43]

    OpenAI: Gpt-5.1: A smarter, more conversational chatgpt (Nov 2025),https: //openai.com/index/gpt-5-1/1, 2

  44. [44]

    ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

    Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

  45. [45]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.000208

  46. [46]

    Hopf,et al

    Raisinghani, N.: Nano banana 2: Combining pro capabilities with lightning-fast speed (Feb 2026),https://blog.google/innovation- and- ai/technology/ai/ nano-banana-2/13, 14, 2 18 B. Hopf,et al

  47. [47]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) 1

  48. [48]

    In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

    Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- Forensics++: Learning to detect manipulated facial images. In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

  49. [49]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 2, 3, 4, 7, 11

  50. [50]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18699–18708 (2022),https://api.semanticscholar.org/CorpusID:2482279161, 3, 10, 4, 8, 12

  51. [51]

    2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

    Sun, K., Chen, S., Yao, T., Sun, X., Ding, S., Ji, R.: Towards general visual- linguistic face forgery detection. 2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 19576–19586 (2023),https://api. semanticscholar.org/CorpusID:2603342192, 3, 5, 8, 9, 10, 11, 4, 6

  52. [52]

    Tan, H., Lan, J., Tan, Z., Liu, A., Song, C., Shi, S., Zhu, H., Wang, W., Wan, J., Lei, Z.: Veritas: Generalizable deepfake detection via pattern-aware reasoning (2026),https://arxiv.org/abs/2508.210483, 11

  53. [53]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXivabs/1905.11946(2019),https://api.semanticscholar.org/ CorpusID:1672172614

  54. [54]

    ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

    Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering. ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

  55. [55]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2387–2395 (2016),https: //api.semanticscholar.org/CorpusID:528585691

  56. [56]

    In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

    Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

  57. [57]

    Veo: Veo 3 demo | sailor and the sea (2025),https://www.youtube.com/watch?v= mCFMn0UkRt01

  58. [58]

    ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

    Wadhwa, N., Garg, R., Jacobs, D.E., Feldman, B.E., Kanazawa, N., Carroll, R., Movshovitz-Attias, Y., Barron, J.T., Pritch, Y., Levoy, M.: Synthetic depth-of-field with a single-camera mobile phone. ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

  59. [59]

    Wakefield, J.: Deepfake presidents used in russia-ukraine war (Mar 2022),https: //www.bbc.com/news/technology-607801421

  60. [60]

    Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8692–8701 (2019),https://api. semanticscholar.org/CorpusID:20944479810

  61. [61]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023),https://arxiv.org/abs/2201.119032, 3 The Regularizing Power of Language-Training Deepfake Detectors 19

  62. [62]

    Wikipedia: (Oct 2025),https://en.wikipedia.org/wiki/Will_Smith_Eating_ Spaghetti_test13, 14

  63. [63]

    ArXivabs/2307.01426(2023),https://api

    Yan, Z., Zhang, Y., Yuan, X., Lyu, S., Wu, B.: Deepfakebench: A comprehensive benchmark of deepfake detection. ArXivabs/2307.01426(2023),https://api. semanticscholar.org/CorpusID:2593421578, 9, 4, 12

  64. [64]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8984–8994 (2023),https://api.semanticscholar.org/CorpusID:2652946231, 3, 4

  65. [65]

    Yan, Z., Wang, J., Jin, P., Zhang, K.Y., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for generalizable ai-generated image detection (2025),https://arxiv.org/abs/2411.156333, 8, 9, 4

  66. [66]

    arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

    Yan, Z., Yao, T., Chen, S., Zhao, Y., Fu, X., Zhu, J., Luo, D., Yuan, L., Wang, C., Ding, S., et al.: Df40: Toward next-generation deepfake detection. arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

  67. [67]

    2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp

    Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gen- eralizable deepfake detection. 2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp. 22355–22366 (2023),https://api.semanticscholar. org/CorpusID:2583524311, 3, 4

  68. [68]

    Yan, Z., Zhao, Y., Chen, S., Guo, M., Fu, X., Yao, T., Ding, S., Yuan, L.: Gen- eralizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning (2024),https://arxiv.org/abs/2408.170653

  69. [69]

    Yu,P.,Fei,J.,Gao,H.,Feng,X.,Xia,Z.,Chang,C.H.:Unlockingthecapabilitiesof large vision-language models for generalizable and explainable deepfake detection (2025),https://arxiv.org/abs/2503.148532, 3, 5, 8, 9, 10, 11, 4, 6

  70. [70]

    Zhang, Y., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense rea- soning for deepfake detection (2024),https://arxiv.org/abs/2402.001262, 3, 8, 9, 10, 11, 4, 5, 7

  71. [71]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W.: Learning self-consistency for deepfake detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 15003–15013 (2020),https://api.semanticscholar.org/ CorpusID:2364564483

  72. [72]

    Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection (2021),https://arxiv.org/abs/2108. 066933

  73. [73]

    Zhou, Z., Luo, Y., Wu, Y., Sun, K., Ji, J., Yan, K., Ding, S., Sun, X., Wu, Y., Ji, R.: Aigi-holmes: Towards explainable and generalizable ai-generated image detection via multimodal large language models (2025),https://arxiv.org/abs/2507. 026643

  74. [74]

    ArXivabs/2210.12752(2022),https://api

    Zhuang, W., Chu, Q., Tan, Z., Liu, Q., Yuan, H., Miao, C., Luo, Z., Yu, N.: Uia-vit: Unsupervised inconsistency-aware method based on vision trans- former for face forgery detection. ArXivabs/2210.12752(2022),https://api. semanticscholar.org/CorpusID:2530981893, 8, 9, 4

  75. [75]

    (deep)fake

    Zou, Z., Gong, B., Wang, L.: Attention to neural plagiarism: Diffusion models can plagiarize your copyrighted images! In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19546–19556 (October 2025) 13, 14 The Regularizing Power of Language-Training Deepfake Detectors 1 The Regularizing Power of Language-Training Deepfa...