arxiv: 2604.17307 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Generalizable Face Forgery Detection via Separable Prompt Learning

Enrui Yang , Yuezun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords face forgery detectiondeepfake detectionCLIPprompt learninggeneralizabilitycross-dataset evaluationcross-modality alignmenttext modality

0 comments

The pith

Separable prompt learning on CLIP's text modality disentangles forgery cues to improve generalizable face forgery detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that focusing on CLIP's text encoder rather than its visual encoder, through a separable prompt learning strategy, turns the model into an effective face forgery detector. It separates forgery-specific information from irrelevant details using two prompt types along with cross-modality alignment and dedicated objectives. A sympathetic reader would care because most deepfake detectors lose accuracy on new datasets or forgery methods, and this approach reuses large-scale pretraining via simple text-side adaptation to gain robustness.

Core claim

The central claim is that a Separable Prompt Learning (SePL) strategy disentangles forgery-specific and forgery-irrelevant information in images via two types of prompt learning. A cross-modality alignment strategy and set of objectives enable this separation so that the text modality can instruct forgery detection. With this adaptation, the method achieves competitive or superior performance compared to other methods under both cross-dataset and cross-method evaluation.

What carries the argument

Separable Prompt Learning (SePL) using two prompt types to separate forgery-specific from forgery-irrelevant information, plus cross-modality alignment objectives, to adapt CLIP's text encoder for detection.

If this is right

The CLIP model serves as an effective face forgery detector after the simple text-focused adaptation.
Performance remains competitive or superior to existing methods in cross-dataset settings.
Performance also holds in cross-method settings with different forgery generation techniques.
The disentanglement of information types drives the observed generalizability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Text-side prompting may offer a lightweight route to adapt other vision-language models for forensic or detection tasks.
The same separation of relevant versus irrelevant cues could apply to generalization problems in related areas such as anomaly or manipulation detection.
Combining the separable text prompts with visual-side prompts might produce further gains in accuracy.

Load-bearing premise

The text modality of CLIP can be leveraged to instruct Deepfake detection with meticulous design via disentangling forgery-specific and forgery-irrelevant information through prompt learning and cross-modality alignment.

What would settle it

A cross-dataset or cross-method test in which the SePL method fails to match or exceed the performance of standard CLIP visual adaptations or prior deepfake detectors.

Figures

Figures reproduced from arXiv: 2604.17307 by Enrui Yang, Yuezun Li.

**Figure 2.** Figure 2: Pipeline of SePL. Instance-level conditional context vector [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the first-stage training process. The forgery-irrelevant [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of feature distributions on FF++ c23. (a) Backbone features before disentangling. (b) Forgery-specific features after disentangling. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualizations on four forgery types from FF++ (c23). Each pair of columns corresponds to one manipulation method (DF, F2F, FS, NT).. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Robustness evaluation under six types of image perturbations at five severity levels. We compare our method against FWA, LSDA, and SBI. The rightmost subplot reports the average AUC across all perturbation types. Our method consistently maintains superior AUC under increasing distortion severity, demonstrating strong robustness to diverse real-world corruptions. TABLE VIII BENCHMARKING RESULTS OF CROSS-MET… view at source ↗

read the original abstract

Detecting face forgeries using CLIP has recently emerged as a promising and increasingly popular research direction. Owing to its rich visual knowledge acquired through large-scale pretraining, most existing methods typically rely on the visual encoder of CLIP, while paying limited attention to the text modality. Given the instructive nature of the text modality, we posit that it can be leveraged to instruct Deepfake detection with meticulous design. Accordingly, we shift the focus from the visual modality to the text modality and propose a new Separable Prompt Learning strategy (SePL) that enables CLIP to serve as an effective face forgery detector. The core idea of SePL is to disentangle forgery-specific and forgery-irrelevant information in images via two types of prompt learning, with the former enhancing detection. To achieve this disentangle, we describe a cross-modality alignment strategy and a set of dedicated objectives. Extensive experiments demonstrate that, with this simple adaptation, our method achieves competitive and even superior performance compared to other methods under both cross-dataset and cross-method evaluation, highlighting its strong generalizability. The codes have been released at https://github.com/OUC-YER/SePL-DeepfakeDetection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SePL tries to make CLIP's text branch separate forgery cues from content via dedicated prompts and alignment losses, but the paper gives no direct check that the separation is actually happening rather than generic prompt tuning.

read the letter

The new piece is the Separable Prompt Learning setup that splits text prompts into forgery-specific and forgery-irrelevant branches, then uses cross-modality alignment plus extra objectives to push the visual encoder toward the forgery branch. That is a clean shift from the usual CLIP deepfake papers that mostly ignore or lightly use the text side. Releasing the code is also useful; anyone can see exactly how the prompts are initialized and optimized. The claimed cross-dataset and cross-method gains are the standard test in this area, so if the numbers hold they would be worth noting. The soft spot is the missing validation that the disentanglement is doing real work. Nothing in the description shows the forgery prompts attending to known low-level artifacts like edge inconsistencies or frequency spikes while the other branch suppresses them. Without that, the performance edge could just come from having more tunable parameters on the text side. The experiments appear to follow the usual protocol, but the central mechanistic claim needs more than end-to-end accuracy to land. This is for people already working on vision-language adaptations for forensics or generalization in detection. A reader who wants a ready-to-run CLIP baseline with an extra prompt trick will find it practical. It deserves peer review because the idea is straightforward, the code is public, and the evaluation covers the right settings; a referee can ask for the missing attention or ablation checks without starting from scratch.

Referee Report

2 major / 1 minor

Summary. The paper proposes Separable Prompt Learning (SePL) as an adaptation of CLIP for face forgery detection. It shifts emphasis to the text modality by introducing two types of prompts to disentangle forgery-specific information from forgery-irrelevant content, supported by a cross-modality alignment strategy and dedicated objectives. The central claim is that this yields competitive or superior performance under cross-dataset and cross-method evaluations, demonstrating strong generalizability, with code released.

Significance. If the disentanglement mechanism is shown to isolate forgery artifacts rather than performing generic adaptation, the work could meaningfully extend multimodal prompt learning to forgery detection tasks and improve robustness across datasets and forgery methods. The public code release supports reproducibility and is a clear strength.

major comments (2)

Abstract: the assertion of 'competitive and even superior performance' is presented without any quantitative metrics, baseline comparisons, dataset names, or ablation results, leaving the central empirical claim unsupported in the provided text and requiring explicit verification in the experiments section.
Method section (description of SePL and cross-modality alignment): no evidence is shown that the forgery-specific prompts attend to known low-level forgery signals (e.g., blending boundaries or frequency anomalies) while the irrelevant branch suppresses them; without attention maps, activation visualizations, or controlled ablations isolating the separable design from standard prompt tuning, the performance gains could arise from generic CLIP adaptation rather than the claimed disentanglement.

minor comments (1)

Abstract: 'meticulous design' and 'dedicated objectives' are used without naming the objectives or alignment loss terms, which should be introduced with equations for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our claims.

read point-by-point responses

Referee: Abstract: the assertion of 'competitive and even superior performance' is presented without any quantitative metrics, baseline comparisons, dataset names, or ablation results, leaving the central empirical claim unsupported in the provided text and requiring explicit verification in the experiments section.

Authors: The abstract is written as a concise high-level summary following standard conventions. Detailed quantitative support, including AUC metrics, baseline comparisons, and results on datasets such as FaceForensics++, Celeb-DF, and DFDC under cross-dataset and cross-method protocols, is provided in Section 4 and Tables 1-4. We will revise the abstract to include key performance highlights (e.g., average AUC improvements) to make the central claim more self-contained. revision: yes
Referee: Method section (description of SePL and cross-modality alignment): no evidence is shown that the forgery-specific prompts attend to known low-level forgery signals (e.g., blending boundaries or frequency anomalies) while the irrelevant branch suppresses them; without attention maps, activation visualizations, or controlled ablations isolating the separable design from standard prompt tuning, the performance gains could arise from generic CLIP adaptation rather than the claimed disentanglement.

Authors: Section 4.3 presents ablation studies with controlled variants that isolate the separable prompt design and cross-modality objectives from generic prompt tuning, showing clear performance drops when these components are ablated. These results indicate the gains arise from the disentanglement rather than generic adaptation. To further strengthen interpretability, we will add attention map visualizations and activation analyses demonstrating focus on forgery artifacts in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes an empirical adaptation of CLIP via separable prompt learning (SePL) for face forgery detection, using two prompt types, cross-modality alignment, and dedicated objectives to disentangle forgery-specific information. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs or self-referential definitions. The central claims rest on experimental results under cross-dataset and cross-method settings rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The approach is described as a design choice with released code, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly proposed SePL strategy and associated objectives, which are introduced without upstream derivation or external benchmarks in the abstract.

invented entities (1)

Separable Prompt Learning (SePL) no independent evidence
purpose: Disentangle forgery-specific and forgery-irrelevant information in images via two types of prompt learning for enhanced detection
Newly introduced strategy in the paper to shift focus to text modality and achieve disentanglement.

pith-pipeline@v0.9.0 · 5500 in / 1169 out tokens · 38362 ms · 2026-05-10T06:01:10.951175+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 3 canonical work pages · 1 internal anchor

[1]

FaceForensics++: Learning to detect manipulated facial images,

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to detect manipulated facial images,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1–11

2019
[2]

The DeepFake Detection Challenge (DFDC) Dataset

B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer, “The DeepFake detection challenge (DFDC) dataset,” arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review arXiv 2006
[3]

MesoNet: A compact facial video forgery detection network,

D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “MesoNet: A compact facial video forgery detection network,” inProceedings of the IEEE International Workshop on Information Forensics and Security, 2018, pp. 1–7

2018
[4]

Multi- attentional deepfake detection,

H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu, “Multi- attentional deepfake detection,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2021, pp. 2185– 2194

2021
[5]

End-to- end reconstruction-classification learning for face forgery detection,

J. Cao, C. Ma, T. Yao, S. Chen, S. Ding, and X. Yang, “End-to- end reconstruction-classification learning for face forgery detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4113–4122

2022
[6]

Detecting deepfakes with self-blended images,

K. Shiohara and T. Yamasaki, “Detecting deepfakes with self-blended images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 720–18 729

2022
[7]

Beyond the prior forgery knowledge: Mining critical clues for general face forgery detection,

A. Luo, C. Kong, J. Huang, Y . Hu, X. Kang, and A. C. Kot, “Beyond the prior forgery knowledge: Mining critical clues for general face forgery detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1168–1182, 2024

2024
[9]

DeepRhythm: Exposing deepfakes with attentional visual heartbeat rhythms,

H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y . Liu, and J. Zhao, “DeepRhythm: Exposing deepfakes with attentional visual heartbeat rhythms,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1318–1327

2020
[10]

Thinking in frequency: Face forgery detection by mining frequency-aware clues,

Y . Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” inEuropean Conference on Computer Vision, ser. Lecture Notes in Computer Science, vol. 12357. Springer, 2020, pp. 86–103

2020
[11]

Generalizing face forgery detec- tion with high-frequency features,

Y . Luo, Y . Zhang, J. Yan, and W. Liu, “Generalizing face forgery detec- tion with high-frequency features,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 317–16 326

2021
[12]

Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain,

H. Liu, X. Li, W. Zhou, Y . Chen, Y . He, H. Xue, W. Zhang, and N. Yu, “Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 772–781

2021
[13]

Constructing new backbone networks via space-frequency interactive convolution for deepfake detection,

Z. Guo, Z. Jia, L. Wang, D. Wang, G. Yang, and N. K. Kasabov, “Constructing new backbone networks via space-frequency interactive convolution for deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 401–413, 2024

2024
[14]

Exposing DeepFake videos by detecting face warping artifacts,

Y . Li and S. Lyu, “Exposing DeepFake videos by detecting face warping artifacts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 46–52

2019
[15]

Face X-Ray for more general face forgery detection,

L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face X-Ray for more general face forgery detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5001–5010

2020
[16]

CORE: COnsistent REpresentation learning for face forgery detection,

Y . Ni, D. Meng, C. Yu, C. Quan, D. Ren, and Y . Zhao, “CORE: COnsistent REpresentation learning for face forgery detection,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 12–21

2022
[17]

Uncovering common features for generalizable deepfake detection,

Z. Yan, Y . Zhang, Y . Fan, and B. Wu, “Uncovering common features for generalizable deepfake detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 412–22 423

2023
[18]

Transcending forgery specificity with latent space augmentation for generalizable deepfake detection,

Z. Yan, Y . Luo, S. Lyu, Q. Liu, and B. Wu, “Transcending forgery specificity with latent space augmentation for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8984–8994

2024
[19]

Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection,

L. Chen, Y . Zhang, Y . Song, L. Liu, and J. Wang, “Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 710–18 719

2022
[20]

Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

Y . Lin, W. Song, B. Li, Y . Li, J. Ni, H. Chen, and Q. Li, “Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,” inEuropean Conference on Computer Vision, ser. Lecture Notes in Computer Science, vol. 15144. Springer, 2024, pp. 104–122

2024
[21]

Exploring disentangled content informa- tion for face forgery detection,

J. Liang, H. Shi, and W. Deng, “Exploring disentangled content informa- tion for face forgery detection,” inEuropean Conference on Computer Vision, ser. Lecture Notes in Computer Science, vol. 13674. Springer, 2022, pp. 128–145

2022
[22]

Exposing the deception: Uncovering more forgery clues for deepfake detection,

Z. Ba, Q. Liu, Z. Liu, S. Wu, F. Lin, L. Lu, and K. Ren, “Exposing the deception: Uncovering more forgery clues for deepfake detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 719–728

2024
[23]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021, pp. 8748–8763

2021
[24]

Forensics adapter: Adapt- ing CLIP for generalizable face forgery detection,

X. Cui, Y . Li, A. Luo, J. Zhou, and J. Dong, “Forensics adapter: Adapt- ing CLIP for generalizable face forgery detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[25]

Orthogonal subspace decomposition for generalizable AI-generated image detection,

Z. Yan, J. Wang, P. Jin, K.-Y . Zhang, C. Liu, S. Chen, T. Yao, S. Ding, B. Wu, and L. Yuan, “Orthogonal subspace decomposition for generalizable AI-generated image detection,” inProceedings of the International Conference on Machine Learning, 2025

2025
[26]

C2P-CLIP: Injecting category common prompt in CLIP to enhance generalization in deepfake detection,

C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y . Zhao, and Y . Wei, “C2P-CLIP: Injecting category common prompt in CLIP to enhance generalization in deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025

2025
[27]

DeepfakeCLIP: Semantic-opposite prompt learning for generalizable deepfake detection,

X. Chen, T. Huang, W. Liu, Z. Wang, W. Li, W. Huang, R. Chen, and H. Luo, “DeepfakeCLIP: Semantic-opposite prompt learning for generalizable deepfake detection,”Knowledge-Based Systems, vol. 330, p. 114681, 2025

2025
[28]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022

2022
[29]

In ictu oculi: Exposing AI created fake face videos by detecting eye blinking,

Y . Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing AI created fake face videos by detecting eye blinking,” inProceedings of the IEEE International Workshop on Information Forensics and Security, 2018, pp. 1–7. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

2018
[30]

Exposing deep fakes using inconsistent head poses,

X. Yang, Y . Li, and S. Lyu, “Exposing deep fakes using inconsistent head poses,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 8261–8265

2019
[31]

FakeCatcher: Detection of synthetic portrait videos using biological signals,

U. A. Ciftci, I. Demir, and L. Yin, “FakeCatcher: Detection of synthetic portrait videos using biological signals,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 767–782, 2022

2022
[32]

LAA-Net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection,

D. Nguyen, N. Mejri, I. P. Singh, P. Kuleshova, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “LAA-Net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 395–17 405

2024
[33]

FreqBlender: Enhancing deepfake detection by blending frequency knowledge,

H. Liet al., “FreqBlender: Enhancing deepfake detection by blending frequency knowledge,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

2024
[34]

Exploring bi-level inconsistency via blended images for generalizable face forgery detection,

J. Huanget al., “Exploring bi-level inconsistency via blended images for generalizable face forgery detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 7034–7048, 2024

2024
[35]

Learning on gradients: Generalized artifacts representation for GAN-generated images detec- tion,

C. Tan, Y . Zhao, S. Wei, G. Gu, and Y . Wei, “Learning on gradients: Generalized artifacts representation for GAN-generated images detec- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 105–12 114

2023
[36]

Re- thinking the up-sampling operations in CNN-based generative network for generalizable deepfake detection,

C. Tan, H. Liu, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Re- thinking the up-sampling operations in CNN-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 130–28 139

2024
[37]

Attention consistency refined masked frequency forgery representation for gen- eralizing face forgery detection,

D. Liu, T. Chen, C. Peng, N. Wang, R. Hu, and X. Gao, “Attention consistency refined masked frequency forgery representation for gen- eralizing face forgery detection,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 504–515, 2025

2025
[38]

Towards universal fake image detec- tors that generalize across generative models,

U. Ojha, Y . Li, and Y . J. Lee, “Towards universal fake image detec- tors that generalize across generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 480–24 489

2023
[39]

Standing on the shoulders of giants: Reprogramming visual-language model for general deepfake detection,

K. Lin, Y . Lin, W. Li, T. Yao, and B. Li, “Standing on the shoulders of giants: Reprogramming visual-language model for general deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelli- gence, vol. 39, 2025, pp. 5262–5270

2025
[40]

CLIPping the deception: Adapting vision-language models for universal deepfake detection,

S. A. Khan and D.-T. Dang-Nguyen, “CLIPping the deception: Adapting vision-language models for universal deepfake detection,” inProceed- ings of the ACM International Conference on Multimedia Retrieval, 2024, pp. 1006–1015

2024
[41]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 709–727

2022
[42]

MaPLe: Multi-modal prompt learning,

M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “MaPLe: Multi-modal prompt learning,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2023, pp. 19 113– 19 122

2023
[43]

Visual-language prompt tuning with knowledge-guided context optimization,

H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6757–6767

2023
[44]

Conditional prompt learning for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022, pp. 16 816– 16 825

2022
[45]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008

2017
[46]

LoRA: Low-rank adaptation of large language models,

E. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proceedings of the International Conference on Learning Representa- tions, 2022

2022
[47]

Supervised contrastive learn- ing,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learn- ing,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 18 661–18 673

2020
[48]

Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection, 2025

P. Yu, J. Fei, H. Gao, X. Feng, Z. Xia, and C. H. Chang, “Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection,” arXiv preprint arXiv:2503.14853, 2025

work page arXiv 2025
[49]

Implicit identity driven deepfake face swapping detection,

B. Huang, Z. Wang, J. Yang, J. Ai, Q. Zou, Q. Wang, and D. Ye, “Implicit identity driven deepfake face swapping detection,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4490–4499

2023
[50]

Can we leave deepfake data behind in training deepfake detector?

J. Cheng, Z. Yan, Y . Zhang, Y . Luo, Z. Wang, and C. Li, “Can we leave deepfake data behind in training deepfake detector?” inAdvances in Neural Information Processing Systems, vol. 37, 2024

2024
[51]

A hybrid model for generalizable deepfake detection via blending, semantic, and general artifacts,

M. K. Le-Phan, M. H. Le, M. T. Tran, and T. L. Do, “A hybrid model for generalizable deepfake detection via blending, semantic, and general artifacts,” inProceedings of the 2nd Workshop on Security-Centric Strategies for Combating Information Disorder, 2025

2025
[52]

Exploring unbiased deepfake detection via token-level shuffling and mixing,

X. Fu, Z. Yan, T. Yao, S. Chen, and X. Li, “Exploring unbiased deepfake detection via token-level shuffling and mixing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3040–3048

2025
[53]

Celeb-DF: A large-scale challenging dataset for DeepFake forensics,

Y . Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A large-scale challenging dataset for DeepFake forensics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3207–3216

2020
[54]

Contributing data to deepfake detection research,

Google AI Blog, “Contributing data to deepfake detection research,” https://ai.googleblog.com/2019/09/ contributing-data-to-deepfake-detection.html, 2019

2019
[55]

arXiv preprint arXiv:1910.08854 , year=

B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, “The deepfake detection challenge (DFDC) preview dataset,” arXiv preprint arXiv:1910.08854, 2019

work page arXiv 1910
[56]

WildDeepfake: A challenging real-world dataset for deepfake detection,

B. Zi, M. Chang, J. Chen, X. Ma, and Y .-G. Jiang, “WildDeepfake: A challenging real-world dataset for deepfake detection,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2382–2390

2020
[57]

Designing one unified framework for high-fidelity face reenactment and swapping,

C. Xu, J. Zhang, Y . Han, G. Tian, X. Zeng, Y . Tai, Y . Wang, C. Wang, and Y . Liu, “Designing one unified framework for high-fidelity face reenactment and swapping,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 54–71

2022
[58]

BlendFace: Re-designing identity encoders for face-swapping,

K. Shiohara, X. Yang, and T. Taketomi, “BlendFace: Re-designing identity encoders for face-swapping,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7634–7644

2023
[59]

MobileFaceSwap: A lightweight framework for video face swapping,

Z. Xu, Z. Hong, C. Ding, Z. Zhu, J. Han, J. Liu, and E. Ding, “MobileFaceSwap: A lightweight framework for video face swapping,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 2973–2981

2022
[60]

Fine- grained face swapping via regional GAN inversion,

Z. Liu, M. Li, Y . Zhang, C. Wang, Q. Zhang, J. Wang, and Y . Nie, “Fine- grained face swapping via regional GAN inversion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8578–8587

2023
[61]

FaceDancer: Pose- and occlusion-aware high fidelity face swapping,

F. Rosberg, E. E. Aksoy, F. Alonso-Fernandez, and C. Englund, “FaceDancer: Pose- and occlusion-aware high fidelity face swapping,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3443–3452

2023
[62]

FSGAN: Subject agnostic face swapping and reenactment,

Y . Nirkin, Y . Keller, and T. Hassner, “FSGAN: Subject agnostic face swapping and reenactment,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2019, pp. 7184–7193

2019
[63]

Inswapper: Insightface face swapping model,

InsightFace, “Inswapper: Insightface face swapping model,” https:// github.com/haofanwang/inswapper, 2023

2023
[64]

SimSwap: An efficient framework for high fidelity face swapping,

R. Chen, X. Chen, B. Ni, and Y . Ge, “SimSwap: An efficient framework for high fidelity face swapping,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2003–2011

2020
[65]

DF40: Toward next-generation deepfake detection,

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, and L. Yuan, “DF40: Toward next-generation deepfake detection,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

2024
[66]

PyTorch: An imperative style, high- performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high- performance deep learning library,” inAdvances in Neural Information Processing...

2019
[67]

Grad-CAM: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2017, pp. 618–626

2017
[68]

DeeperForensics- 1.0: A large-scale dataset for real-world face forgery detection,

L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, “DeeperForensics- 1.0: A large-scale dataset for real-world face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2889–2898

2020
[69]

Progressive growing of GANs for improved quality, stability, and variation,

T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” inProceedings of the International Conference on Learning Representations, 2018. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11 Enrui Yangis currently pursuing the B.E. degree with the School of Computer Science and Te...

2018