PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Chunjie Ma; Jiahe Zhang; Tian Gan; Weili Guan; Yaning Zhang; Zan Gao

arxiv: 2504.14129 · v5 · submitted 2025-04-19 · 💻 cs.CV

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Yaning Zhang , Jiahe Zhang , Chunjie Ma , Weili Guan , Tian Gan , Zan Gao This is my paper

Pith reviewed 2026-05-22 18:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake attributionzero-shot learningvision language modelface parsingcontrastive learningGANdiffusion models

0 comments

The pith

A parsing-aware vision language model attributes deepfakes to unseen generators by tracking differences in facial attribute preservation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PVLM to solve zero-shot deepfake attribution by building on the fact that GAN and diffusion generators preserve source facial attributes differently. It adds a parsing encoder that extracts global face attribute embeddings and performs dynamic vision-parsing matching to learn forgery features. A new contrastive center loss is defined to draw embeddings from the same generator type together and repel those from different types. The resulting model is tested on a dedicated ZS-DFA benchmark that measures fine-grained attribution to advanced unseen generators such as diffusion models. Experiments across multiple protocols show higher accuracy than previous deepfake attribution methods.

Core claim

The central claim is that differences in how GAN versus diffusion generators retain facial attributes can be turned into reliable zero-shot attribution signals by feeding a vision-language model with a dedicated parsing encoder, dynamic vision-parsing matching, and a deepfake attribution contrastive center loss that pulls same-generator samples closer while pushing different-generator samples apart.

What carries the argument

Parsing encoder producing global face attribute embeddings for dynamic vision-parsing matching, plus a contrastive center loss that organizes generator embeddings in feature space.

If this is right

The method supplies a fine-grained protocol for measuring attribution performance on diffusion-based generators never seen in training.
Attribute preservation differences become a usable cue for learning forgery representations without generator-specific labels.
The contrastive center loss can be dropped into other deepfake attribution models to improve separation of generator classes.
State-of-the-art results hold across multiple evaluation protocols on the new ZS-DFA benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If preservation patterns remain stable for future generator families, the same parsing cue could reduce the need to retrain attribution systems each time a new model appears.
Adding text or audio parsing alongside face parsing might tighten attribution when visual cues alone are ambiguous.
Real-time parsing pipelines could turn this approach into an online filter for social-media uploads.

Load-bearing premise

Facial attribute preservation differences between GAN and diffusion generators stay consistent enough across images to support reliable distinction of entirely new generators.

What would settle it

Collect images from a previously unseen advanced generator, run the trained PVLM model, and observe whether attribution accuracy falls to chance level when only the parsing-derived features are used.

Figures

Figures reproduced from arXiv: 2504.14129 by Chunjie Ma, Jiahe Zhang, Tian Gan, Weili Guan, Yaning Zhang, Zan Gao.

**Figure 3.** Figure 3: Cross-generator correlation matrix visualization. We randomly select [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: The visualization of priors from different domains including face parsing, edge, and frequency. Each column shows a face yielded by various generators. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: The workflow of our PVLM model to conduct ZS-DFA. We first send the appearance image to the Sobel, SRM operator, face parser, and fine-grained [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of parsing images produced by different face parsers for fake images synthesized by various generators. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness to unknown image deformations. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: The t-SNE visualization of various models (w/o or w/ DFACC loss). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: The heatmap visualization of various models on the sample created by the seen (Left) or unseen (Right) generators. The hotter (red color) a position [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Confusion matrix visualization of various methods. The darker red means more frequent or stronger predictions, and the lighter red denotes less [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PVLM brings face parsing and a contrastive center loss into a VLM for zero-shot deepfake attribution to unseen generators, but the benchmark and results lack the controls needed to back the generalization claim.

read the letter

The paper's main move is to add a parsing encoder and dynamic vision-parsing matching to a vision-language backbone, plus a new contrastive center loss, all aimed at attributing deepfakes to unseen diffusion-style generators. They also release a ZS-DFA benchmark for this setting. That combination looks new relative to earlier DFA work that stayed mostly inside vision features, and the motivation from differing facial-attribute preservation between GANs and diffusion models is at least plausible on first read.

Referee Report

3 major / 2 minor

Summary. The paper introduces PVLM, a parsing-aware vision-language model with dynamic contrastive learning for zero-shot deepfake attribution (ZS-DFA). It constructs a new benchmark to evaluate fine-grained attribution performance on unseen advanced generators such as diffusion models. The method is motivated by differences in source face attribute preservation between GAN- and diffusion-generated images; it employs a parsing encoder for global attribute embeddings, dynamic vision-parsing matching, and a contrastive center loss to pull relevant generators closer while pushing irrelevant ones away. Experiments claim that PVLM exceeds prior state-of-the-art methods across multiple protocol evaluations on the proposed ZS-DFA benchmark.

Significance. If the central claims hold after addressing the noted gaps, the work would offer a timely multimodal extension of vision-language models to deepfake attribution, with the new ZS-DFA benchmark filling a gap in evaluating generalization to diffusion-based generators. The parsing-aware and contrastive components provide a concrete mechanism for exploiting attribute-preservation differences, and successful validation could influence subsequent forensic methods that combine vision, language, and structural priors.

major comments (3)

[Benchmark construction section] Benchmark construction section: The manuscript introduces a novel fine-grained ZS-DFA benchmark but supplies no details on data sources, generator selection for seen versus unseen splits, sample counts, resolution or quality controls, or statistical reporting (error bars, significance tests). This information is load-bearing for the headline claim that PVLM exceeds SOTA on the benchmark, as post-hoc choices or distribution shifts could inflate the reported gains.
[Motivation and §4.3] Motivation and §4.3 (contrastive center loss): The core premise that facial-attribute preservation differences between GAN and diffusion generators are sufficiently consistent and category-general to support zero-shot attribution to unseen advanced models is stated but not independently verified (e.g., no cross-generator attribute statistics or ablation on parsing noise). Without such evidence, the dynamic vision-parsing matching and contrastive loss rest on an untested assumption that directly underpins the generalization results.
[§4.3] §4.3, contrastive center loss formulation: The loss is described as pulling relevant generators closer and pushing irrelevant ones away, yet the manuscript provides neither the explicit equation nor training details that would demonstrate the loss is computed without reference to the zero-shot evaluation distributions. This leaves open the possibility that the reported improvements incorporate information unavailable at true zero-shot test time.

minor comments (2)

[Method section] Notation for the dynamic vision-parsing matching module is introduced without a clear diagram or pseudocode, making it difficult to follow how the parsing encoder output is fused with the vision-language features.
[Related work] The related-work discussion cites prior DFA methods but omits recent vision-language models applied to forgery detection; adding these would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [Benchmark construction section] Benchmark construction section: The manuscript introduces a novel fine-grained ZS-DFA benchmark but supplies no details on data sources, generator selection for seen versus unseen splits, sample counts, resolution or quality controls, or statistical reporting (error bars, significance tests). This information is load-bearing for the headline claim that PVLM exceeds SOTA on the benchmark, as post-hoc choices or distribution shifts could inflate the reported gains.

Authors: We acknowledge this is a valid concern for reproducibility and interpretability of results. In the revised manuscript we will expand the benchmark construction section with explicit details on data sources (e.g., FFHQ and CelebA-HQ for real faces, specific GAN models for the seen split, and diffusion models such as Stable Diffusion and DDPM variants for the unseen split), the exact protocol for creating seen/unseen partitions to avoid leakage, per-class sample counts, uniform resolution and quality filtering steps, and statistical reporting including standard deviations across multiple random seeds plus significance testing. revision: yes
Referee: [Motivation and §4.3] Motivation and §4.3 (contrastive center loss): The core premise that facial-attribute preservation differences between GAN and diffusion generators are sufficiently consistent and category-general to support zero-shot attribution to unseen advanced models is stated but not independently verified (e.g., no cross-generator attribute statistics or ablation on parsing noise). Without such evidence, the dynamic vision-parsing matching and contrastive loss rest on an untested assumption that directly underpins the generalization results.

Authors: We agree that stronger independent verification of the motivating premise would improve the paper. While the current manuscript contains supporting qualitative observations and component ablations, we will add a dedicated quantitative analysis subsection (or appendix) reporting cross-generator facial-attribute preservation statistics (e.g., mean IoU or embedding similarity for eyes, nose, mouth regions) computed via the parsing encoder on held-out GAN versus diffusion samples. We will also include a controlled ablation that injects varying levels of parsing noise and measures the resulting change in zero-shot attribution accuracy, thereby empirically grounding the assumption. revision: yes
Referee: [§4.3] §4.3, contrastive center loss formulation: The loss is described as pulling relevant generators closer and pushing irrelevant ones away, yet the manuscript provides neither the explicit equation nor training details that would demonstrate the loss is computed without reference to the zero-shot evaluation distributions. This leaves open the possibility that the reported improvements incorporate information unavailable at true zero-shot test time.

Authors: We apologize for the missing explicit formulation. In the revision we will insert the complete mathematical definition of the deepfake attribution contrastive center loss in §4.3, together with the precise training algorithm. The loss operates exclusively on the labeled training set of seen generators; generator centers are computed and updated solely from seen data, and no samples, labels, or statistics from the unseen zero-shot test distributions are ever accessed during training or loss computation. This separation will be stated unambiguously to confirm compliance with zero-shot evaluation protocols. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PVLM derivation or loss formulation

full rationale

The paper introduces a novel PVLM attributor and ZS-DFA benchmark motivated by observed differences in facial attribute preservation between GAN and diffusion generators. It describes a parsing encoder for dynamic vision-parsing matching and a contrastive center loss to pull relevant generators closer, with experimental results claimed to exceed SOTA. No equations are provided in the text that would demonstrate any self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain. The central claims rest on empirical evaluation on the newly constructed benchmark rather than any tautological equivalence to inputs by construction. The load-bearing premise about attribute differences is presented as an observation supporting the method, not derived from the method itself. This is a standard non-circular proposal of a new architecture and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about attribute preservation differences being generalizable.

pith-pipeline@v0.9.0 · 5841 in / 1157 out tokens · 21236 ms · 2026-05-22T18:51:46.673266+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method... dynamic vision-parsing matching... deepfake attribution contrastive center loss
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

Generative Adversarial Nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative Adversarial Nets,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680

work page 2014
[2]

DiffFace: Diffusion-based face swapping with facial guidance,

K. Kim, Y . Kim, S. Cho, J. Seo, J. Nam, K. Lee, S. Kim, and K. Lee, “DiffFace: Diffusion-based face swapping with facial guidance,”arXiv preprint arXiv:2212.13344, 2022

work page arXiv 2022
[3]

Deep- fake generation and detection: A benchmark and survey

G. Pei, J. Zhang, M. Hu, Z. Zhang, C. Wang, Y . Wu, G. Zhai, J. Yang, C. Shen, and D. Tao, “Deepfake generation and detection: A benchmark and survey,”arXiv preprint arXiv:2403.17881, 2024

work page arXiv 2024
[4]

De-fake: Detection and attri- bution of fake images generated by text-to-image generation models,

Z. Sha, Z. Li, N. Yu, and Y . Zhang, “De-fake: Detection and attri- bution of fake images generated by text-to-image generation models,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2023, p. 3418–3432

work page 2023
[5]

Deepfake network architecture attribution,

T. Yang, Z. Huang, J. Cao, L. Li, and X. Li, “Deepfake network architecture attribution,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022, pp. 4662–4670

work page 2022
[6]

Contrastive pseudo learning for open-world deepfake attribution,

Z. Sun, S. Chen, T. Yao, B. Yin, R. Yi, S. Ding, and L. Ma, “Contrastive pseudo learning for open-world deepfake attribution,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 20 825–20 835

work page 2023
[7]

Rethinking open- world deepfake attribution with multi-perspective sensory learning,

Z. Sun, S. Chen, T. Yao, R. Yi, S. Ding, and L. Ma, “Rethinking open- world deepfake attribution with multi-perspective sensory learning,”Int. J. Comput. Vision, vol. 133, no. 2, p. 628–651, Aug. 2024

work page 2024
[8]

Diffusion facial forgery detection,

H. Cheng, Y . Guo, T. Wang, L. Nie, and M. Kankanhalli, “Diffusion facial forgery detection,” inProceedings of the 32nd ACM International Conference on Multimedia (MM), 2024, p. 5939–5948

work page 2024
[9]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

work page 2021
[10]

Genface: A large-scale fine-grained face forgery benchmark and cross appearance-edge learning,

Y . Zhang, Z. Yu, T. Wang, X. Huang, L. Shen, Z. Gao, and J. Ren, “Genface: A large-scale fine-grained face forgery benchmark and cross appearance-edge learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 8559–8572, 2024

work page 2024
[11]

Mfclip: Multi-modal fine-grained clip for generalizable diffusion face forgery detection,

Y . Zhang, T. Wang, Z. Yu, Z. Gao, L. Shen, and S. Chen, “Mfclip: Multi-modal fine-grained clip for generalizable diffusion face forgery detection,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 5888–5903, 2025

work page 2025
[12]

face-parsing,

V . Yakhyokhuja, “face-parsing,” https://github.com/zllrunning/ face-parsing.PyTorch, 2024

work page 2024
[13]

Contrastive-center loss for deep neural networks,

C. Qi and F. Su, “Contrastive-center loss for deep neural networks,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), September 2017, pp. 2851–2855

work page 2017
[14]

Towards discovery and attribution of open-world gan generated images,

S. Girish, S. Suri, S. S. Rambhatla, and A. Shrivastava, “Towards discovery and attribution of open-world gan generated images,” in Proceedings of the IEEE international conference on computer vision (CVPR), 2021, pp. 14 094–14 103

work page 2021
[15]

Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion,

J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, and C. Fookes, “Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion,” inProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023, pp. 769–778

work page 2023
[16]

Carzero: Cross-attention alignment for radiology zero-shot classifica- tion,

H. Lai, Q. Yao, Z. Jiang, R. Wang, Z. He, X. Tao, and S. K. Zhou, “Carzero: Cross-attention alignment for radiology zero-shot classifica- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 11 137–11 146

work page 2024
[17]

Flip-80m: 80 million visual-linguistic pairs for facial language-image pre-training,

Y . Li, X. Hou, Z. Dezhi, L. Shen, and Z. Zhao, “Flip-80m: 80 million visual-linguistic pairs for facial language-image pre-training,” in Proceedings of the ACM International Conference on Multimedia (MM), 2024, p. 58–67

work page 2024
[18]

Label propagation for zero-shot clas- sification with vision-language models,

Y . Kalantidis, G. Toliaset al., “Label propagation for zero-shot clas- sification with vision-language models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23 209–23 218

work page 2024
[19]

Improved zero-shot classification by adapting vlms with text descriptions,

O. Saha, G. Van Horn, and S. Maji, “Improved zero-shot classification by adapting vlms with text descriptions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 542–17 552

work page 2024
[20]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,

C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y . Zhao, and Y . Wei, “C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7184–7192

work page 2025
[21]

Forensics adapter: Adapting clip for generalizable face forgery detection,

X. Cui, Y . Li, A. Luo, J. Zhou, and J. Dong, “Forensics adapter: Adapting clip for generalizable face forgery detection,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 207–19 217

work page 2025
[22]

Design of an image edge detection filter using the Sobel operator,

N. Kanopoulos, N. Vasanthavada, and R. L. Baker, “Design of an image edge detection filter using the Sobel operator,”IEEE Journal of Solid- state Circuits, vol. 23, no. 2, pp. 358–367, 1988

work page 1988
[23]

Rich models for steganalysis of digital images,

J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital images,”IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012

work page 2012
[24]

Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 5, 2024, pp. 5052–5060

work page 2024
[25]

Deepfake video detection using convolutional vision transformer,

D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” 2021, arXiv preprint arXiv:2102.11126

work page arXiv 2021
[26]

Lampmark: Proactive deepfake detection via training-free landmark perceptual wa- termarks,

T. Wang, M. Huang, H. Cheng, X. Zhang, and Z. Shen, “Lampmark: Proactive deepfake detection via training-free landmark perceptual wa- termarks,” inProceedings of the ACM International Conference on Multimedia (MM), 2024, p. 10515–10524

work page 2024
[27]

Distilled transformers with locally enhanced global representations for face forgery detection,

Y . Zhang, Q. Li, Z. Yu, and L. Shen, “Distilled transformers with locally enhanced global representations for face forgery detection,”Pattern Recognition, vol. 161, p. 111253, 2025

work page 2025
[28]

Towards benchmarking and evaluating deepfake detection,

J. Deng, C. Lin, P. Hu, C. Shen, Q. Wang, Q. Li, and Q. Li, “Towards benchmarking and evaluating deepfake detection,”IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 6, pp. 5112–5127, 2024

work page 2024
[29]

Ada-finfer: Inferring face representations from adaptive select frames for high- visual-quality deepfake detection,

J. Hu, J. Liang, Z. Qin, X. Liao, W. Zhou, and X. Lin, “Ada-finfer: Inferring face representations from adaptive select frames for high- visual-quality deepfake detection,”IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 3, pp. 3011–3027, 2025

work page 2025
[30]

Deepfake detection and localiza- tion using multi-view inconsistency measurement,

B. Zhang, Q. Yin, W. Lu, and X. Luo, “Deepfake detection and localiza- tion using multi-view inconsistency measurement,”IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 2, pp. 1796–1809, 2025

work page 2025
[31]

Df40: Toward next-generation deepfake detection,

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, and L. Yuan, “Df40: Toward next-generation deepfake detection,” inProceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 37, 2024, pp. 29 387–29 434

work page 2024
[32]

Progressive growing of GANs for improved quality, stability, and variation,

T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” inProceedings of the International Conference on Learning Representations (ICLR), 2018, pp. 26–37

work page 2018
[33]

Celeb-df++: A large-scale chal- lenging video deepfake benchmark for generalizable forensics,

Y . Li, D. Zhu, X. Cui, and S. Lyu, “Celeb-df++: A large-scale chal- lenging video deepfake benchmark for generalizable forensics,”arXiv preprint arXiv:2507.18015, 2025

work page arXiv 2025
[34]

A style-based generator architecture for generative adversarial networks,

T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4396– 4405

work page 2019
[35]

Realistic and efficient face swapping: A unified approach with diffusion models,

S. Baliah, Q. Lin, S. Liao, X. Liang, and M. H. Khan, “Realistic and efficient face swapping: A unified approach with diffusion models,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV), February 2025, pp. 1062–1071

work page 2025
[36]

A baseline for detecting misclassified and out-of-distribution examples in neural networks,

D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,”Proceedings of International Conference on Learning Representations, 2017

work page 2017
[37]

Learning Confidence for Out-of-Distribution Detection in Neural Networks

T. DeVries and G. W. Taylor, “Learning confidence for out- of-distribution detection in neural networks,”arXiv preprint arXiv:1802.04865, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProceedings of the International Conference on Learning Represen- tations (ICLR), May 2015, pp. 1–15

work page 2015
[39]

Wilddeepfake: A challenging real-world dataset for deepfake detection,

B. Zi, M. Chang, J. Chen, X. Ma, and Y .-G. Jiang, “Wilddeepfake: A challenging real-world dataset for deepfake detection,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2382–2390

work page 2020
[40]

Celeb-df: A large- scale challenging dataset for deepfake forensics,

Y . Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large- scale challenging dataset for deepfake forensics,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3204–3213. 15

work page 2020
[41]

The DeepFake Detection Challenge (DFDC) Dataset

B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. Canton-Ferrer, “The deepfake detection challenge dataset,”arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[42]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection,

L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, “Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 2886–2895

work page 2020
[43]

Visualizing data using t-sne

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

work page 2008
[44]

Segface: Face segmentation of long-tail classes,

K. Narayan, V . VS, and V . M. Patel, “Segface: Face segmentation of long-tail classes,”arXiv preprint arXiv:2412.08647, 2024

work page arXiv 2024
[45]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017, pp. 618–626. Yaning Zhangreceived the double bachelor’s de- gree in Internet of Things Engineering and En...

work page 2017

[1] [1]

Generative Adversarial Nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative Adversarial Nets,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680

work page 2014

[2] [2]

DiffFace: Diffusion-based face swapping with facial guidance,

K. Kim, Y . Kim, S. Cho, J. Seo, J. Nam, K. Lee, S. Kim, and K. Lee, “DiffFace: Diffusion-based face swapping with facial guidance,”arXiv preprint arXiv:2212.13344, 2022

work page arXiv 2022

[3] [3]

Deep- fake generation and detection: A benchmark and survey

G. Pei, J. Zhang, M. Hu, Z. Zhang, C. Wang, Y . Wu, G. Zhai, J. Yang, C. Shen, and D. Tao, “Deepfake generation and detection: A benchmark and survey,”arXiv preprint arXiv:2403.17881, 2024

work page arXiv 2024

[4] [4]

De-fake: Detection and attri- bution of fake images generated by text-to-image generation models,

Z. Sha, Z. Li, N. Yu, and Y . Zhang, “De-fake: Detection and attri- bution of fake images generated by text-to-image generation models,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2023, p. 3418–3432

work page 2023

[5] [5]

Deepfake network architecture attribution,

T. Yang, Z. Huang, J. Cao, L. Li, and X. Li, “Deepfake network architecture attribution,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022, pp. 4662–4670

work page 2022

[6] [6]

Contrastive pseudo learning for open-world deepfake attribution,

Z. Sun, S. Chen, T. Yao, B. Yin, R. Yi, S. Ding, and L. Ma, “Contrastive pseudo learning for open-world deepfake attribution,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 20 825–20 835

work page 2023

[7] [7]

Rethinking open- world deepfake attribution with multi-perspective sensory learning,

Z. Sun, S. Chen, T. Yao, R. Yi, S. Ding, and L. Ma, “Rethinking open- world deepfake attribution with multi-perspective sensory learning,”Int. J. Comput. Vision, vol. 133, no. 2, p. 628–651, Aug. 2024

work page 2024

[8] [8]

Diffusion facial forgery detection,

H. Cheng, Y . Guo, T. Wang, L. Nie, and M. Kankanhalli, “Diffusion facial forgery detection,” inProceedings of the 32nd ACM International Conference on Multimedia (MM), 2024, p. 5939–5948

work page 2024

[9] [9]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

work page 2021

[10] [10]

Genface: A large-scale fine-grained face forgery benchmark and cross appearance-edge learning,

Y . Zhang, Z. Yu, T. Wang, X. Huang, L. Shen, Z. Gao, and J. Ren, “Genface: A large-scale fine-grained face forgery benchmark and cross appearance-edge learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 8559–8572, 2024

work page 2024

[11] [11]

Mfclip: Multi-modal fine-grained clip for generalizable diffusion face forgery detection,

Y . Zhang, T. Wang, Z. Yu, Z. Gao, L. Shen, and S. Chen, “Mfclip: Multi-modal fine-grained clip for generalizable diffusion face forgery detection,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 5888–5903, 2025

work page 2025

[12] [12]

face-parsing,

V . Yakhyokhuja, “face-parsing,” https://github.com/zllrunning/ face-parsing.PyTorch, 2024

work page 2024

[13] [13]

Contrastive-center loss for deep neural networks,

C. Qi and F. Su, “Contrastive-center loss for deep neural networks,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), September 2017, pp. 2851–2855

work page 2017

[14] [14]

Towards discovery and attribution of open-world gan generated images,

S. Girish, S. Suri, S. S. Rambhatla, and A. Shrivastava, “Towards discovery and attribution of open-world gan generated images,” in Proceedings of the IEEE international conference on computer vision (CVPR), 2021, pp. 14 094–14 103

work page 2021

[15] [15]

Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion,

J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, and C. Fookes, “Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion,” inProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023, pp. 769–778

work page 2023

[16] [16]

Carzero: Cross-attention alignment for radiology zero-shot classifica- tion,

H. Lai, Q. Yao, Z. Jiang, R. Wang, Z. He, X. Tao, and S. K. Zhou, “Carzero: Cross-attention alignment for radiology zero-shot classifica- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 11 137–11 146

work page 2024

[17] [17]

Flip-80m: 80 million visual-linguistic pairs for facial language-image pre-training,

Y . Li, X. Hou, Z. Dezhi, L. Shen, and Z. Zhao, “Flip-80m: 80 million visual-linguistic pairs for facial language-image pre-training,” in Proceedings of the ACM International Conference on Multimedia (MM), 2024, p. 58–67

work page 2024

[18] [18]

Label propagation for zero-shot clas- sification with vision-language models,

Y . Kalantidis, G. Toliaset al., “Label propagation for zero-shot clas- sification with vision-language models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23 209–23 218

work page 2024

[19] [19]

Improved zero-shot classification by adapting vlms with text descriptions,

O. Saha, G. Van Horn, and S. Maji, “Improved zero-shot classification by adapting vlms with text descriptions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 542–17 552

work page 2024

[20] [20]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,

C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y . Zhao, and Y . Wei, “C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7184–7192

work page 2025

[21] [21]

Forensics adapter: Adapting clip for generalizable face forgery detection,

X. Cui, Y . Li, A. Luo, J. Zhou, and J. Dong, “Forensics adapter: Adapting clip for generalizable face forgery detection,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 207–19 217

work page 2025

[22] [22]

Design of an image edge detection filter using the Sobel operator,

N. Kanopoulos, N. Vasanthavada, and R. L. Baker, “Design of an image edge detection filter using the Sobel operator,”IEEE Journal of Solid- state Circuits, vol. 23, no. 2, pp. 358–367, 1988

work page 1988

[23] [23]

Rich models for steganalysis of digital images,

J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital images,”IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012

work page 2012

[24] [24]

Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 5, 2024, pp. 5052–5060

work page 2024

[25] [25]

Deepfake video detection using convolutional vision transformer,

D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” 2021, arXiv preprint arXiv:2102.11126

work page arXiv 2021

[26] [26]

Lampmark: Proactive deepfake detection via training-free landmark perceptual wa- termarks,

T. Wang, M. Huang, H. Cheng, X. Zhang, and Z. Shen, “Lampmark: Proactive deepfake detection via training-free landmark perceptual wa- termarks,” inProceedings of the ACM International Conference on Multimedia (MM), 2024, p. 10515–10524

work page 2024

[27] [27]

Distilled transformers with locally enhanced global representations for face forgery detection,

Y . Zhang, Q. Li, Z. Yu, and L. Shen, “Distilled transformers with locally enhanced global representations for face forgery detection,”Pattern Recognition, vol. 161, p. 111253, 2025

work page 2025

[28] [28]

Towards benchmarking and evaluating deepfake detection,

J. Deng, C. Lin, P. Hu, C. Shen, Q. Wang, Q. Li, and Q. Li, “Towards benchmarking and evaluating deepfake detection,”IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 6, pp. 5112–5127, 2024

work page 2024

[29] [29]

Ada-finfer: Inferring face representations from adaptive select frames for high- visual-quality deepfake detection,

J. Hu, J. Liang, Z. Qin, X. Liao, W. Zhou, and X. Lin, “Ada-finfer: Inferring face representations from adaptive select frames for high- visual-quality deepfake detection,”IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 3, pp. 3011–3027, 2025

work page 2025

[30] [30]

Deepfake detection and localiza- tion using multi-view inconsistency measurement,

B. Zhang, Q. Yin, W. Lu, and X. Luo, “Deepfake detection and localiza- tion using multi-view inconsistency measurement,”IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 2, pp. 1796–1809, 2025

work page 2025

[31] [31]

Df40: Toward next-generation deepfake detection,

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, and L. Yuan, “Df40: Toward next-generation deepfake detection,” inProceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 37, 2024, pp. 29 387–29 434

work page 2024

[32] [32]

Progressive growing of GANs for improved quality, stability, and variation,

T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” inProceedings of the International Conference on Learning Representations (ICLR), 2018, pp. 26–37

work page 2018

[33] [33]

Celeb-df++: A large-scale chal- lenging video deepfake benchmark for generalizable forensics,

Y . Li, D. Zhu, X. Cui, and S. Lyu, “Celeb-df++: A large-scale chal- lenging video deepfake benchmark for generalizable forensics,”arXiv preprint arXiv:2507.18015, 2025

work page arXiv 2025

[34] [34]

A style-based generator architecture for generative adversarial networks,

T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4396– 4405

work page 2019

[35] [35]

Realistic and efficient face swapping: A unified approach with diffusion models,

S. Baliah, Q. Lin, S. Liao, X. Liang, and M. H. Khan, “Realistic and efficient face swapping: A unified approach with diffusion models,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV), February 2025, pp. 1062–1071

work page 2025

[36] [36]

A baseline for detecting misclassified and out-of-distribution examples in neural networks,

D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,”Proceedings of International Conference on Learning Representations, 2017

work page 2017

[37] [37]

Learning Confidence for Out-of-Distribution Detection in Neural Networks

T. DeVries and G. W. Taylor, “Learning confidence for out- of-distribution detection in neural networks,”arXiv preprint arXiv:1802.04865, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProceedings of the International Conference on Learning Represen- tations (ICLR), May 2015, pp. 1–15

work page 2015

[39] [39]

Wilddeepfake: A challenging real-world dataset for deepfake detection,

B. Zi, M. Chang, J. Chen, X. Ma, and Y .-G. Jiang, “Wilddeepfake: A challenging real-world dataset for deepfake detection,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2382–2390

work page 2020

[40] [40]

Celeb-df: A large- scale challenging dataset for deepfake forensics,

Y . Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large- scale challenging dataset for deepfake forensics,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3204–3213. 15

work page 2020

[41] [41]

The DeepFake Detection Challenge (DFDC) Dataset

B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. Canton-Ferrer, “The deepfake detection challenge dataset,”arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[42] [42]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection,

L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, “Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 2886–2895

work page 2020

[43] [43]

Visualizing data using t-sne

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

work page 2008

[44] [44]

Segface: Face segmentation of long-tail classes,

K. Narayan, V . VS, and V . M. Patel, “Segface: Face segmentation of long-tail classes,”arXiv preprint arXiv:2412.08647, 2024

work page arXiv 2024

[45] [45]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017, pp. 618–626. Yaning Zhangreceived the double bachelor’s de- gree in Internet of Things Engineering and En...

work page 2017