The Regularizing Power of Language-Training Deepfake Detectors

Benedikt Hopf; Radu Timofte; Zongwei Wu

arxiv: 2605.31192 · v1 · pith:DEXFSJCGnew · submitted 2026-05-29 · 💻 cs.CV

The Regularizing Power of Language-Training Deepfake Detectors

Benedikt Hopf , Zongwei Wu , Radu Timofte This is my paper

Pith reviewed 2026-06-28 23:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake detectionmultimodal LLMregularizationcross-dataset generalizationinterpretabilityreinforcement learningdual-encoder architecture

0 comments

The pith

Language training regularizes deepfake detectors by steering them toward high-level generalizable features rather than low-level dataset artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that deepfake detectors typically overfit to low-level domain-specific cues that fail to transfer across datasets. It proposes that an LLM pretrained on language will naturally favor high-level, describable artifacts that generalize better, and shows this can be exploited as a regularization mechanism. A dual-encoder architecture pairs a frozen specialist detector with a LoRA-tuned multimodal LLM encoder. Training proceeds in two stages: first binary alignment to combine features, then reinforcement learning that rewards the model for generating descriptive reasoning before classifying, using only binary labels. This produces both interpretable outputs and measurable gains in cross-dataset accuracy, even when the reasoning step is dropped at test time.

Core claim

The paper establishes that a dual-encoder architecture combining a frozen specialist detector with a LoRA-tuned MLLM encoder, trained first through binary alignment and then through reinforcement learning that incentivizes explain-then-classify behavior, enables the model to prioritize high-level robust features. This yields improved cross-dataset generalization and produces descriptive reasoning chains, with the performance benefit persisting even when those chains are omitted during inference.

What carries the argument

Dual-encoder architecture (frozen specialist detector paired with LoRA-tuned MLLM encoder) plus two-stage curriculum of binary alignment followed by RL for explain-then-classify.

If this is right

Cross-dataset performance exceeds prior state-of-the-art methods by a large margin.
The model produces human-readable reasoning chains before classifying.
Accuracy gains remain even when reasoning chains are removed at inference time.
The approach combines high-level language features where possible with low-level features only when necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curriculum could be tested on other binary classification tasks that suffer from domain-specific overfitting, such as medical image anomaly detection.
If the preference for describable features is the active ingredient, simpler prompting strategies without RL might achieve partial regularization gains.
The method suggests that interpretability and generalization can be pursued jointly rather than traded off.

Load-bearing premise

An LLM pretrained on language will intrinsically prefer high-level describable artifacts over low-level domain-specific ones, allowing the RL stage to successfully steer the model toward robust features.

What would settle it

Run the RL stage on multiple datasets and measure whether cross-dataset accuracy fails to improve beyond the binary-alignment baseline or whether the generated descriptions remain generic and non-predictive of the classification decision.

Figures

Figures reproduced from arXiv: 2605.31192 by Benedikt Hopf, Radu Timofte, Zongwei Wu.

**Figure 1.** Figure 1: Motivation. Previous methods usually (a) do not provide language output or (b) learn post-hoc explanations from supervised data (human-annotated or handcrafted features). Using supervised finetuning, reinforcement learning, and a dualencoder design, our method jointly learns language and detection, leading not only to interpretable descriptions but also benefiting from the implicit regularization of l… view at source ↗

**Figure 2.** Figure 2: First stage: modality alignment. Tokens from deepfake detector, vision encoder, and text are passed to the model, asking for a one-word answer. The model outputs a probability distribution over all words, which we can supervise with binary labels and calculate binary metrics from. Note that, unlike previous work, we do not need a separate classification head and directly use the MLLM for classification,… view at source ↗

**Figure 3.** Figure 3: Second stage: Reinforcement learning. We provide the model with the requested output structure, a question regarding the authenticity of the candidate image, and the image itself. We then sample multiple answers, judge them using our reward functions, and train using GRPO [49]. Negative advantages are discouraged, positive ones encouraged. This strengthens the alignment between the modalities, as all compo… view at source ↗

**Figure 4.** Figure 4: Out-of-domain examples. The first image is the famous Will-Smith-eatingspaghetti example [62] by Google’s Veo 3 [10], the second one is taken from [75]. The center left is generated by Google Gemini 2.5 Flash [17], the center right and lower left by Gemini 3 Flash [46], and the final image is a still from [40]. We specifically include a failure case, showing that very high-quality images can avoid detecti… view at source ↗

read the original abstract

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes RL on a LoRA-tuned MLLM to regularize deepfake detection toward high-level features, but the binary reward gives no guarantee the explanations will be causal or that low-level overfitting is reduced.

read the letter

This paper's main pitch is that language-based RL can regularize a deepfake detector by making it favor describable high-level features, leading to better generalization and built-in interpretability. The dual-encoder with a frozen specialist and LoRA-tuned MLLM, trained first for binary alignment then with RL for explain-then-classify, is the concrete proposal.

The new element is applying RL with only binary supervision to encourage reasoning chains in this domain. It does a reasonable job framing the joint solution to generalization and interpretability.

The soft spot is that the reward signal does not penalize non-causal explanations, so the model could keep relying on low-level cues while outputting fluent text. The abstract claims large gains on benchmarks, but without the numbers or ablations visible here, it's difficult to tell if the RL stage actually delivers the regularization effect or if the first stage already does most of the work.

The assumption that the MLLM pretraining will naturally push toward robust features is plausible but not obviously enforced by the setup.

This work is aimed at the deepfake detection community, especially those exploring multimodal models. A reader working on similar regularization ideas could get value from the architecture even if the results need verification.

It deserves serious refereeing to examine the experiments and test whether the claimed mechanism holds.

Referee Report

2 major / 1 minor

Summary. The paper claims that deepfake detection overfitting to low-level domain-specific artifacts can be mitigated by leveraging MLLM language pretraining as a regularizer. It introduces a dual-encoder architecture (frozen specialist detector + LoRA-tuned MLLM) trained in two stages: (1) binary alignment to combine features and reduce dataset-specific overfitting, and (2) RL that rewards 'explain-then-classify' behavior using only binary labels. This is asserted to yield both interpretable descriptions and improved cross-dataset generalization, even when reasoning is omitted at inference, with extensive experiments showing large-margin outperformance over SOTA.

Significance. If the results hold, the work would demonstrate a practical mechanism for using language pretraining bias to favor generalizable high-level features in detection tasks, simultaneously advancing interpretability and robustness without extra supervision. The two-stage curriculum and dual-encoder design are concrete contributions that could influence future multimodal regularization approaches in CV.

major comments (2)

[Abstract] Abstract: The central claim that the RL stage 'explicitly incentivize[s] the model to prioritize high-level, robust features' rests on rewarding only binary classification correctness plus format compliance. No term in the reward penalizes post-hoc or non-causal explanations, so the mechanism does not demonstrably force the model away from low-level cues that still produce correct binary labels.
[Abstract] Abstract (method description): The assertion that 'an LLM that has been pretrained on language would prefer high-level artifacts that can be described better' is presented as an intrinsic bias that the RL curriculum exploits, yet no ablation or analysis is referenced showing that the generated explanations are faithful to the detector's actual decision features rather than fluent but decoupled text.

minor comments (1)

[Title] The title contains an apparent hyphenation inconsistency ('Language-Training').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, proposing targeted revisions to the abstract and discussion sections where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the RL stage 'explicitly incentivize[s] the model to prioritize high-level, robust features' rests on rewarding only binary classification correctness plus format compliance. No term in the reward penalizes post-hoc or non-causal explanations, so the mechanism does not demonstrably force the model away from low-level cues that still produce correct binary labels.

Authors: We agree that the reward signal consists solely of binary correctness and format compliance and therefore does not contain an explicit penalty for post-hoc or non-causal reasoning. The claim that the RL stage 'explicitly incentivizes' prioritization of high-level features is therefore stronger than the evidence directly supports. We will revise the abstract to replace 'explicitly incentivize' with 'encourage via the explain-then-classify format' and will add a short paragraph in the discussion acknowledging that the regularization effect is indirect and could in principle be satisfied by low-level cues accompanied by fluent but non-causal text. revision: yes
Referee: [Abstract] Abstract (method description): The assertion that 'an LLM that has been pretrained on language would prefer high-level artifacts that can be described better' is presented as an intrinsic bias that the RL curriculum exploits, yet no ablation or analysis is referenced showing that the generated explanations are faithful to the detector's actual decision features rather than fluent but decoupled text.

Authors: The referee is correct that the manuscript offers no ablation or analysis (e.g., comparison with saliency maps or intervention experiments) demonstrating that the generated explanations are faithful to the features actually used by the dual-encoder rather than fluent but decoupled text. We will add this point explicitly to the limitations subsection and will outline possible verification approaches for future work, while retaining the original hypothesis as a motivating intuition rather than a proven mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and binary supervision

full rationale

The paper presents a two-stage curriculum (binary alignment then RL for explain-then-classify) whose central mechanism is the use of an externally pretrained MLLM plus binary labels to incentivize high-level features. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The intuition that language pretraining biases toward describable artifacts is stated as motivation rather than derived from prior self-work, and the RL reward is explicitly binary, leaving the generalization claim as an empirical hypothesis rather than a definitional reduction. The method is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view provides no explicit free parameters, axioms, or invented entities; the method builds on standard components (LoRA, RL with binary labels) whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5789 in / 1089 out tokens · 29391 ms · 2026-06-28T23:18:13.149848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 32 canonical work pages · 11 internal anchors

[1]

2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp

Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp. 1–7 (2018),https://api.semanticscholar. org/CorpusID:521574751

2018
[2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 9, 11, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Detecting generated images by real images only,

Bi, X., Liu, B., Yang, F., Xiao, B., Li, W., Huang, G., Cosman, P.C.: Detecting generated images by real images only. ArXivabs/2311.00962(2023),https: //api.semanticscholar.org/CorpusID:2649353248, 11

work page arXiv 2023
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022) 1, 3, 4

2022
[5]

Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,

Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N.: Antifakeprompt: Prompt-tuned vision- language models are fake image detectors. ArXivabs/2310.17419(2023),https: //api.semanticscholar.org/CorpusID:2644904908, 11

work page arXiv 2023
[6]

chief financial officer

Chen, H., Magramo, K.: Finance worker pays out $25 million after video call with deepfake “chief financial officer” (Feb 2024),https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk1

2024
[7]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

Chollet, F.: Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1800– 1807 (2016),https://api.semanticscholar.org/CorpusID:23751108

2017
[8]

Cui, X., Li, Y., Zhu, D., Zhou, J., Dong, J., Lyu, S.: Forensics adapter: Unleashing clip for generalizable face forgery detection (2025),https://arxiv.org/abs/2411. 197153

2025
[9]

com / deepfakes / faceswap (2019) 1

Deepfakes: deepfakes_faceswap.https : / / github . com / deepfakes / faceswap (2019) 1

2019
[10]

Deepmind, G.: Veo: a text-to-video generation system (2025) 13, 14, 2

2025
[11]

DeepSeek-AI: Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning (2025),https://arxiv.org/abs/2501.129483, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms (2023),https://arxiv.org/abs/2305.143146

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

The DeepFake Detection Challenge (DFDC) Dataset

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Canton-Ferrer, C.: The deepfake detection challenge dataset. ArXivabs/2006.07397(2020), https://api.semanticscholar.org/CorpusID:2196876161, 8, 10, 3

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Canton-Ferrer, C.: The deep- fake detection challenge (dfdc) preview dataset. ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

work page arXiv 1910
[15]

Bootstrap methods: Another look at the jackknife,

Efron, B.: Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics7(1), 1 – 26 (1979).https://doi.org/10.1214/aos/1176344552,https: //doi.org/10.1214/aos/11763445525

work page doi:10.1214/aos/1176344552 1979
[16]

com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B

Favreau, J., Lucas, G.: The mandalorian (2020),https://www.disneyplus. com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B. Hopf,et al

2020
[17]

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025),https://developers.googleblog.com/ en/introducing-gemini-2-5-flash-image/13, 14, 1, 2

2025
[18]

Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization (2023),https://arxiv.org/abs/2303. 171114

2023
[19]

Guo, X., Song, X., Zhang, Y., Liu, X., Liu, X.: Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector (2025),https: //arxiv.org/abs/2503.201882, 3, 5, 8, 9, 10, 11, 4, 6

work page arXiv 2025
[20]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Guo, Y., Zhen, C., Yan, P.: Controllable guide-space for generalizable face forgery detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20761–20770 (2023),https://api.semanticscholar.org/CorpusID: 2601648913, 4

2023
[21]

ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

Hasanaath,A.A.,Luqman,H.,Katib,R.,Anwar,S.:Fsbi:Deepfakesdetectionwith frequency enhanced self-blended images. ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

work page arXiv 2024
[22]

ArXivabs/2105.14376(2021),https://api.semanticscholar

He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. ArXivabs/2105.14376(2021),https://api.semanticscholar. org/CorpusID:2352547663

work page arXiv 2021
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

Hopf, B., Timofte, R.: Practical manipulation model for robust deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 5675–5684 (October 2025) 1, 3, 4, 12

2025
[24]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. ArXivabs/2106.09685(2021), https://api.semanticscholar.org/CorpusID:2354580096

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Huang, T.M., Lin, W.T., Hua, K.L., Cheng, W.H., Yamagishi, J., Chen, J.C.: Thinkfake: Reasoning in multimodal large language models for ai-generated image detection (2025),https://arxiv.org/abs/2509.198413, 11

work page arXiv 2025
[26]

2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

Huang, Z., Hu, J., Li, X., He, Y., Zhao, X., Peng, B., Wu, B., Huang, X., Cheng, G.: Sida: Social media image deepfake detection, localization and ex- planation with large multimodal model. 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 28831–28841 (2025),https: //api.semanticscholar.org/CorpusID:2745151453, 8, 10, 11, 13

2025
[27]

Huang, Z., Li, T., Li, X., Wen, H., He, Y., Zhang, J., Fei, H., Yang, X., Huang, X., Peng, B., Cheng, G.: So-fake: Benchmarking and explaining social media image forgery detection (2025),https://arxiv.org/abs/2505.186603, 11

work page arXiv 2025
[28]

Jiang, C., Dong, W., Zhang, Z., Yu, F., Peng, W., Yuan, X., Bi, Y., Zhao, M., Zhou, Z., Si, C., Shan, C.: Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection (2026),https://arxiv.org/abs/2506.009793, 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

IEEE Access8, 83144–83154 (2020),https://api.semanticscholar

Jung,T.,Kim,S.,Kim,K.:Deepvision:Deepfakesdetectionusinghumaneyeblink- ing pattern. IEEE Access8, 83144–83154 (2020),https://api.semanticscholar. org/CorpusID:2186518781

2020
[30]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8107–8116 (2019),https://api. semanticscholar.org/CorpusID:2092022731

2020
[31]

Kim, T., Choi, J., Jeong, Y., Noh, H., Yoo, J., Baek, S., Choi, J.: Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection (2025), https://arxiv.org/abs/2507.023983

work page arXiv 2025
[32]

Kowalski, M.: Faceswap.https://github.com/MarekKowalski/FaceSwap(2018) 1, 12 The Regularizing Power of Language-Training Deepfake Detectors 17

2018
[33]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Larue, N., Vu, N.S., Struc, V., Peer, P., Christophides, V.: Seeable: Soft discrep- ancies and bounded contrastive learning for exposing deepfakes. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20954–20964 (2022), https://api.semanticscholar.org/CorpusID:2537344173, 4

2023
[34]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5000–5009 (2019),https://api. semanticscholar.org/CorpusID:2095164241, 3, 4, 12

2020
[35]

Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

Li, T., Huang, Z., Wen, H., He, Y., Lyu, S., Wu, B., Cheng, G.: Raidx: A retrieval- augmented generation and grpo reinforcement learning framework for explainable deepfake detection. Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

2025
[36]

2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp

Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp. 3204–3213 (2019),https://api.semanticscholar. org/CorpusID:2127264301, 8, 9, 10, 3

2020
[37]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp

Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N.: Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp. 772–781 (2021),https://api.semanticscholar.org/CorpusID: 2320921673, 4

2021
[38]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Luo, Y., Zhang, Y., Yan, J., Liu, W.: Generalizing face forgery detection with high- frequency features. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 16312–16321 (2021),https://api.semanticscholar. org/CorpusID:2323205993

2021
[39]

Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two- branchrecurrentnetworkforisolatingdeepfakesinvideos.ArXivabs/2008.03412 (2020),https://api.semanticscholar.org/CorpusID:2210906633

work page arXiv 2008
[40]

Nep, D.: This is not morgan freeman - a deepfake singularity (2021) 13, 14

2021
[41]

Nguyen, D., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection (2025),https: //arxiv.org/abs/2501.011843, 4

work page arXiv 2025
[42]

Nguyen, D., Mejri, N., Singh, I.P., Kuleshova, P., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Laa-net: Localized artifact attention network for quality-agnostic andgeneralizabledeepfakedetection.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 17395–17405 (2024) 1, 3, 9, 4, 8, 12

2024
[43]

OpenAI: Gpt-5.1: A smarter, more conversational chatgpt (Nov 2025),https: //openai.com/index/gpt-5-1/1, 2

2025
[44]

ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

work page arXiv 2007
[45]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.000208

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Hopf,et al

Raisinghani, N.: Nano banana 2: Combining pro capabilities with lightning-fast speed (Feb 2026),https://blog.google/innovation- and- ai/technology/ai/ nano-banana-2/13, 14, 2 18 B. Hopf,et al

2026
[47]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) 1

2021
[48]

In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- Forensics++: Learning to detect manipulated facial images. In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

2019
[49]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 2, 3, 4, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18699–18708 (2022),https://api.semanticscholar.org/CorpusID:2482279161, 3, 10, 4, 8, 12

2022
[51]

2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

Sun, K., Chen, S., Yao, T., Sun, X., Ding, S., Ji, R.: Towards general visual- linguistic face forgery detection. 2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 19576–19586 (2023),https://api. semanticscholar.org/CorpusID:2603342192, 3, 5, 8, 9, 10, 11, 4, 6

2025
[52]

Tan, H., Lan, J., Tan, Z., Liu, A., Song, C., Shi, S., Zhu, H., Wang, W., Wan, J., Lei, Z.: Veritas: Generalizable deepfake detection via pattern-aware reasoning (2026),https://arxiv.org/abs/2508.210483, 11

work page arXiv 2026
[53]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXivabs/1905.11946(2019),https://api.semanticscholar.org/ CorpusID:1672172614

work page internal anchor Pith review Pith/arXiv arXiv 1905
[54]

ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering. ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

2019
[55]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2387–2395 (2016),https: //api.semanticscholar.org/CorpusID:528585691

2016
[56]

In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

2017
[57]

Veo: Veo 3 demo | sailor and the sea (2025),https://www.youtube.com/watch?v= mCFMn0UkRt01

2025
[58]

ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

Wadhwa, N., Garg, R., Jacobs, D.E., Feldman, B.E., Kanazawa, N., Carroll, R., Movshovitz-Attias, Y., Barron, J.T., Pritch, Y., Levoy, M.: Synthetic depth-of-field with a single-camera mobile phone. ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

work page doi:10.1145/3197517.3201329 2018
[59]

Wakefield, J.: Deepfake presidents used in russia-ukraine war (Mar 2022),https: //www.bbc.com/news/technology-607801421

2022
[60]

Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8692–8701 (2019),https://api. semanticscholar.org/CorpusID:20944479810

2020
[61]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023),https://arxiv.org/abs/2201.119032, 3 The Regularizing Power of Language-Training Deepfake Detectors 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Wikipedia: (Oct 2025),https://en.wikipedia.org/wiki/Will_Smith_Eating_ Spaghetti_test13, 14

2025
[63]

ArXivabs/2307.01426(2023),https://api

Yan, Z., Zhang, Y., Yuan, X., Lyu, S., Wu, B.: Deepfakebench: A comprehensive benchmark of deepfake detection. ArXivabs/2307.01426(2023),https://api. semanticscholar.org/CorpusID:2593421578, 9, 4, 12

work page arXiv 2023
[64]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8984–8994 (2023),https://api.semanticscholar.org/CorpusID:2652946231, 3, 4

2024
[65]

Yan, Z., Wang, J., Jin, P., Zhang, K.Y., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for generalizable ai-generated image detection (2025),https://arxiv.org/abs/2411.156333, 8, 9, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

Yan, Z., Yao, T., Chen, S., Zhao, Y., Fu, X., Zhu, J., Luo, D., Yuan, L., Wang, C., Ding, S., et al.: Df40: Toward next-generation deepfake detection. arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

work page arXiv 2024
[67]

2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp

Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gen- eralizable deepfake detection. 2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp. 22355–22366 (2023),https://api.semanticscholar. org/CorpusID:2583524311, 3, 4

2023
[68]

Yan, Z., Zhao, Y., Chen, S., Guo, M., Fu, X., Yao, T., Ding, S., Yuan, L.: Gen- eralizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning (2024),https://arxiv.org/abs/2408.170653

work page arXiv 2024
[69]

Yu,P.,Fei,J.,Gao,H.,Feng,X.,Xia,Z.,Chang,C.H.:Unlockingthecapabilitiesof large vision-language models for generalizable and explainable deepfake detection (2025),https://arxiv.org/abs/2503.148532, 3, 5, 8, 9, 10, 11, 4, 6

work page arXiv 2025
[70]

Zhang, Y., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense rea- soning for deepfake detection (2024),https://arxiv.org/abs/2402.001262, 3, 8, 9, 10, 11, 4, 5, 7

work page arXiv 2024
[71]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W.: Learning self-consistency for deepfake detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 15003–15013 (2020),https://api.semanticscholar.org/ CorpusID:2364564483

2021
[72]

Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection (2021),https://arxiv.org/abs/2108. 066933

2021
[73]

Zhou, Z., Luo, Y., Wu, Y., Sun, K., Ji, J., Yan, K., Ding, S., Sun, X., Wu, Y., Ji, R.: Aigi-holmes: Towards explainable and generalizable ai-generated image detection via multimodal large language models (2025),https://arxiv.org/abs/2507. 026643

2025
[74]

ArXivabs/2210.12752(2022),https://api

Zhuang, W., Chu, Q., Tan, Z., Liu, Q., Yuan, H., Miao, C., Luo, Z., Yu, N.: Uia-vit: Unsupervised inconsistency-aware method based on vision trans- former for face forgery detection. ArXivabs/2210.12752(2022),https://api. semanticscholar.org/CorpusID:2530981893, 8, 9, 4

work page arXiv 2022
[75]

(deep)fake

Zou, Z., Gong, B., Wang, L.: Attention to neural plagiarism: Diffusion models can plagiarize your copyrighted images! In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19546–19556 (October 2025) 13, 14 The Regularizing Power of Language-Training Deepfake Detectors 1 The Regularizing Power of Language-Training Deepfa...

2025

[1] [1]

2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp

Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS) pp. 1–7 (2018),https://api.semanticscholar. org/CorpusID:521574751

2018

[2] [2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 9, 11, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Detecting generated images by real images only,

Bi, X., Liu, B., Yang, F., Xiao, B., Li, W., Huang, G., Cosman, P.C.: Detecting generated images by real images only. ArXivabs/2311.00962(2023),https: //api.semanticscholar.org/CorpusID:2649353248, 11

work page arXiv 2023

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction- classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022) 1, 3, 4

2022

[5] [5]

Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,

Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N.: Antifakeprompt: Prompt-tuned vision- language models are fake image detectors. ArXivabs/2310.17419(2023),https: //api.semanticscholar.org/CorpusID:2644904908, 11

work page arXiv 2023

[6] [6]

chief financial officer

Chen, H., Magramo, K.: Finance worker pays out $25 million after video call with deepfake “chief financial officer” (Feb 2024),https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk1

2024

[7] [7]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

Chollet, F.: Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1800– 1807 (2016),https://api.semanticscholar.org/CorpusID:23751108

2017

[8] [8]

Cui, X., Li, Y., Zhu, D., Zhou, J., Dong, J., Lyu, S.: Forensics adapter: Unleashing clip for generalizable face forgery detection (2025),https://arxiv.org/abs/2411. 197153

2025

[9] [9]

com / deepfakes / faceswap (2019) 1

Deepfakes: deepfakes_faceswap.https : / / github . com / deepfakes / faceswap (2019) 1

2019

[10] [10]

Deepmind, G.: Veo: a text-to-video generation system (2025) 13, 14, 2

2025

[11] [11]

DeepSeek-AI: Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning (2025),https://arxiv.org/abs/2501.129483, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms (2023),https://arxiv.org/abs/2305.143146

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

The DeepFake Detection Challenge (DFDC) Dataset

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Canton-Ferrer, C.: The deepfake detection challenge dataset. ArXivabs/2006.07397(2020), https://api.semanticscholar.org/CorpusID:2196876161, 8, 10, 3

work page internal anchor Pith review Pith/arXiv arXiv 2006

[14] [14]

ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Canton-Ferrer, C.: The deep- fake detection challenge (dfdc) preview dataset. ArXivabs/1910.08854(2019), https://api.semanticscholar.org/CorpusID:2048009391, 8, 10, 3

work page arXiv 1910

[15] [15]

Bootstrap methods: Another look at the jackknife,

Efron, B.: Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics7(1), 1 – 26 (1979).https://doi.org/10.1214/aos/1176344552,https: //doi.org/10.1214/aos/11763445525

work page doi:10.1214/aos/1176344552 1979

[16] [16]

com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B

Favreau, J., Lucas, G.: The mandalorian (2020),https://www.disneyplus. com / en - de / browse / entity - 422f6dcc - 226f - 44e7 - 98d4 - 22de69b31cf3 ? distributionPartner=google1, 2 16 B. Hopf,et al

2020

[17] [17]

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025),https://developers.googleblog.com/ en/introducing-gemini-2-5-flash-image/13, 14, 1, 2

2025

[18] [18]

Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization (2023),https://arxiv.org/abs/2303. 171114

2023

[19] [19]

Guo, X., Song, X., Zhang, Y., Liu, X., Liu, X.: Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector (2025),https: //arxiv.org/abs/2503.201882, 3, 5, 8, 9, 10, 11, 4, 6

work page arXiv 2025

[20] [20]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Guo, Y., Zhen, C., Yan, P.: Controllable guide-space for generalizable face forgery detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20761–20770 (2023),https://api.semanticscholar.org/CorpusID: 2601648913, 4

2023

[21] [21]

ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

Hasanaath,A.A.,Luqman,H.,Katib,R.,Anwar,S.:Fsbi:Deepfakesdetectionwith frequency enhanced self-blended images. ArXivabs/2406.08625(2024),https: //api.semanticscholar.org/CorpusID:2704405863

work page arXiv 2024

[22] [22]

ArXivabs/2105.14376(2021),https://api.semanticscholar

He, Y., Yu, N., Keuper, M., Fritz, M.: Beyond the spectrum: Detecting deepfakes via re-synthesis. ArXivabs/2105.14376(2021),https://api.semanticscholar. org/CorpusID:2352547663

work page arXiv 2021

[23] [23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

Hopf, B., Timofte, R.: Practical manipulation model for robust deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 5675–5684 (October 2025) 1, 3, 4, 12

2025

[24] [24]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. ArXivabs/2106.09685(2021), https://api.semanticscholar.org/CorpusID:2354580096

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Huang, T.M., Lin, W.T., Hua, K.L., Cheng, W.H., Yamagishi, J., Chen, J.C.: Thinkfake: Reasoning in multimodal large language models for ai-generated image detection (2025),https://arxiv.org/abs/2509.198413, 11

work page arXiv 2025

[26] [26]

2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

Huang, Z., Hu, J., Li, X., He, Y., Zhao, X., Peng, B., Wu, B., Huang, X., Cheng, G.: Sida: Social media image deepfake detection, localization and ex- planation with large multimodal model. 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 28831–28841 (2025),https: //api.semanticscholar.org/CorpusID:2745151453, 8, 10, 11, 13

2025

[27] [27]

Huang, Z., Li, T., Li, X., Wen, H., He, Y., Zhang, J., Fei, H., Yang, X., Huang, X., Peng, B., Cheng, G.: So-fake: Benchmarking and explaining social media image forgery detection (2025),https://arxiv.org/abs/2505.186603, 11

work page arXiv 2025

[28] [28]

Jiang, C., Dong, W., Zhang, Z., Yu, F., Peng, W., Yuan, X., Bi, Y., Zhao, M., Zhou, Z., Si, C., Shan, C.: Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection (2026),https://arxiv.org/abs/2506.009793, 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

IEEE Access8, 83144–83154 (2020),https://api.semanticscholar

Jung,T.,Kim,S.,Kim,K.:Deepvision:Deepfakesdetectionusinghumaneyeblink- ing pattern. IEEE Access8, 83144–83154 (2020),https://api.semanticscholar. org/CorpusID:2186518781

2020

[30] [30]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8107–8116 (2019),https://api. semanticscholar.org/CorpusID:2092022731

2020

[31] [31]

Kim, T., Choi, J., Jeong, Y., Noh, H., Yoo, J., Baek, S., Choi, J.: Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection (2025), https://arxiv.org/abs/2507.023983

work page arXiv 2025

[32] [32]

Kowalski, M.: Faceswap.https://github.com/MarekKowalski/FaceSwap(2018) 1, 12 The Regularizing Power of Language-Training Deepfake Detectors 17

2018

[33] [33]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Larue, N., Vu, N.S., Struc, V., Peer, P., Christophides, V.: Seeable: Soft discrep- ancies and bounded contrastive learning for exposing deepfakes. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 20954–20964 (2022), https://api.semanticscholar.org/CorpusID:2537344173, 4

2023

[34] [34]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5000–5009 (2019),https://api. semanticscholar.org/CorpusID:2095164241, 3, 4, 12

2020

[35] [35]

Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

Li, T., Huang, Z., Wen, H., He, Y., Lyu, S., Wu, B., Cheng, G.: Raidx: A retrieval- augmented generation and grpo reinforcement learning framework for explainable deepfake detection. Proceedings of the 33rd ACM International Conference on Multimedia (2025),https://api.semanticscholar.org/CorpusID:2805363793, 8, 9, 10, 11

2025

[36] [36]

2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp

Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) pp. 3204–3213 (2019),https://api.semanticscholar. org/CorpusID:2127264301, 8, 9, 10, 3

2020

[37] [37]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp

Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N.: Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) pp. 772–781 (2021),https://api.semanticscholar.org/CorpusID: 2320921673, 4

2021

[38] [38]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Luo, Y., Zhang, Y., Yan, J., Liu, W.: Generalizing face forgery detection with high- frequency features. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 16312–16321 (2021),https://api.semanticscholar. org/CorpusID:2323205993

2021

[39] [39]

Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two- branchrecurrentnetworkforisolatingdeepfakesinvideos.ArXivabs/2008.03412 (2020),https://api.semanticscholar.org/CorpusID:2210906633

work page arXiv 2008

[40] [40]

Nep, D.: This is not morgan freeman - a deepfake singularity (2021) 13, 14

2021

[41] [41]

Nguyen, D., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection (2025),https: //arxiv.org/abs/2501.011843, 4

work page arXiv 2025

[42] [42]

Nguyen, D., Mejri, N., Singh, I.P., Kuleshova, P., Astrid, M., Kacem, A., Ghorbel, E., Aouada, D.: Laa-net: Localized artifact attention network for quality-agnostic andgeneralizabledeepfakedetection.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 17395–17405 (2024) 1, 3, 9, 4, 8, 12

2024

[43] [43]

OpenAI: Gpt-5.1: A smarter, more conversational chatgpt (Nov 2025),https: //openai.com/index/gpt-5-1/1, 2

2025

[44] [44]

ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. ArXivabs/2007.09355 (2020),https://api.semanticscholar.org/CorpusID:2206474993

work page arXiv 2007

[45] [45]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.000208

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Hopf,et al

Raisinghani, N.: Nano banana 2: Combining pro capabilities with lightning-fast speed (Feb 2026),https://blog.google/innovation- and- ai/technology/ai/ nano-banana-2/13, 14, 2 18 B. Hopf,et al

2026

[47] [47]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) 1

2021

[48] [48]

In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- Forensics++: Learning to detect manipulated facial images. In: International Con- ference on Computer Vision (ICCV) (2019) 1, 3, 8, 9, 10, 4, 12

2019

[49] [49]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 2, 3, 4, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18699–18708 (2022),https://api.semanticscholar.org/CorpusID:2482279161, 3, 10, 4, 8, 12

2022

[51] [51]

2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

Sun, K., Chen, S., Yao, T., Sun, X., Ding, S., Ji, R.: Towards general visual- linguistic face forgery detection. 2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 19576–19586 (2023),https://api. semanticscholar.org/CorpusID:2603342192, 3, 5, 8, 9, 10, 11, 4, 6

2025

[52] [52]

Tan, H., Lan, J., Tan, Z., Liu, A., Song, C., Shi, S., Zhu, H., Wang, W., Wan, J., Lei, Z.: Veritas: Generalizable deepfake detection via pattern-aware reasoning (2026),https://arxiv.org/abs/2508.210483, 11

work page arXiv 2026

[53] [53]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXivabs/1905.11946(2019),https://api.semanticscholar.org/ CorpusID:1672172614

work page internal anchor Pith review Pith/arXiv arXiv 1905

[54] [54]

ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering. ACM Transac- tions on Graphics (TOG)38, 1 – 12 (2019),https://api.semanticscholar.org/ CorpusID:2199506251

2019

[55] [55]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2387–2395 (2016),https: //api.semanticscholar.org/CorpusID:528585691

2016

[56] [56]

In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Pro- cessing Systems (2017),https://api.semanticscholar.org/CorpusID:13756489 4

2017

[57] [57]

Veo: Veo 3 demo | sailor and the sea (2025),https://www.youtube.com/watch?v= mCFMn0UkRt01

2025

[58] [58]

ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

Wadhwa, N., Garg, R., Jacobs, D.E., Feldman, B.E., Kanazawa, N., Carroll, R., Movshovitz-Attias, Y., Barron, J.T., Pritch, Y., Levoy, M.: Synthetic depth-of-field with a single-camera mobile phone. ACM Transactions on Graphics37(4), 1–13 (Jul 2018).https://doi.org/10.1145/3197517.3201329,http://dx.doi.org/ 10.1145/3197517.320132911

work page doi:10.1145/3197517.3201329 2018

[59] [59]

Wakefield, J.: Deepfake presidents used in russia-ukraine war (Mar 2022),https: //www.bbc.com/news/technology-607801421

2022

[60] [60]

Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8692–8701 (2019),https://api. semanticscholar.org/CorpusID:20944479810

2020

[61] [61]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023),https://arxiv.org/abs/2201.119032, 3 The Regularizing Power of Language-Training Deepfake Detectors 19

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Wikipedia: (Oct 2025),https://en.wikipedia.org/wiki/Will_Smith_Eating_ Spaghetti_test13, 14

2025

[63] [63]

ArXivabs/2307.01426(2023),https://api

Yan, Z., Zhang, Y., Yuan, X., Lyu, S., Wu, B.: Deepfakebench: A comprehensive benchmark of deepfake detection. ArXivabs/2307.01426(2023),https://api. semanticscholar.org/CorpusID:2593421578, 9, 4, 12

work page arXiv 2023

[64] [64]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 8984–8994 (2023),https://api.semanticscholar.org/CorpusID:2652946231, 3, 4

2024

[65] [65]

Yan, Z., Wang, J., Jin, P., Zhang, K.Y., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for generalizable ai-generated image detection (2025),https://arxiv.org/abs/2411.156333, 8, 9, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

Yan, Z., Yao, T., Chen, S., Zhao, Y., Fu, X., Zhu, J., Luo, D., Yuan, L., Wang, C., Ding, S., et al.: Df40: Toward next-generation deepfake detection. arXiv preprint arXiv:2406.13495 (2024) 1, 8, 9, 10, 12, 3

work page arXiv 2024

[67] [67]

2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp

Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for gen- eralizable deepfake detection. 2023 IEEE/CVF International Conference on Com- puter Vision (ICCV) pp. 22355–22366 (2023),https://api.semanticscholar. org/CorpusID:2583524311, 3, 4

2023

[68] [68]

Yan, Z., Zhao, Y., Chen, S., Guo, M., Fu, X., Yao, T., Ding, S., Yuan, L.: Gen- eralizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning (2024),https://arxiv.org/abs/2408.170653

work page arXiv 2024

[69] [69]

Yu,P.,Fei,J.,Gao,H.,Feng,X.,Xia,Z.,Chang,C.H.:Unlockingthecapabilitiesof large vision-language models for generalizable and explainable deepfake detection (2025),https://arxiv.org/abs/2503.148532, 3, 5, 8, 9, 10, 11, 4, 6

work page arXiv 2025

[70] [70]

Zhang, Y., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense rea- soning for deepfake detection (2024),https://arxiv.org/abs/2402.001262, 3, 8, 9, 10, 11, 4, 5, 7

work page arXiv 2024

[71] [71]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W.: Learning self-consistency for deepfake detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 15003–15013 (2020),https://api.semanticscholar.org/ CorpusID:2364564483

2021

[72] [72]

Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection (2021),https://arxiv.org/abs/2108. 066933

2021

[73] [73]

Zhou, Z., Luo, Y., Wu, Y., Sun, K., Ji, J., Yan, K., Ding, S., Sun, X., Wu, Y., Ji, R.: Aigi-holmes: Towards explainable and generalizable ai-generated image detection via multimodal large language models (2025),https://arxiv.org/abs/2507. 026643

2025

[74] [74]

ArXivabs/2210.12752(2022),https://api

Zhuang, W., Chu, Q., Tan, Z., Liu, Q., Yuan, H., Miao, C., Luo, Z., Yu, N.: Uia-vit: Unsupervised inconsistency-aware method based on vision trans- former for face forgery detection. ArXivabs/2210.12752(2022),https://api. semanticscholar.org/CorpusID:2530981893, 8, 9, 4

work page arXiv 2022

[75] [75]

(deep)fake

Zou, Z., Gong, B., Wang, L.: Attention to neural plagiarism: Diffusion models can plagiarize your copyrighted images! In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19546–19556 (October 2025) 13, 14 The Regularizing Power of Language-Training Deepfake Detectors 1 The Regularizing Power of Language-Training Deepfa...

2025