Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline

Jingchun Lian; Lianwei Wu; Lingyu Liu; Li Zhu; Yaxiong Wang; Yujiao Wu; Zhedong Zheng

arxiv: 2412.19685 · v3 · submitted 2024-12-27 · 💻 cs.CV · cs.AI

Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline

Jingchun Lian , Lingyu Liu , Yaxiong Wang , Yujiao Wu , Lianwei Wu , Li Zhu , Zhedong Zheng This is my paper

Pith reviewed 2026-05-23 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords forgery attributionreport generationfacial manipulationmultimodal forensicsforgery localizationMMTT datasetForgeryTalkerexplainable detection

0 comments

The pith

A new task and model generate reports that locate forged facial regions and explain the editing process in natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Forgery Attribution Report Generation as a multimodal task that requires both localizing manipulated areas in facial images and producing grounded textual explanations of the edits. Existing detection methods stop at binary labels or pixel masks and offer no semantic account of the manipulation, so the authors supply a large dataset and a baseline system to make joint localization-plus-explanation feasible. The MMTT dataset contains 152,217 samples, each paired with a process-derived mask and a human-written description of the editing steps. ForgeryTalker uses a shared vision-language encoder plus two separate decoders to output both the mask and the report in one forward pass.

Core claim

The paper claims that a single end-to-end network, ForgeryTalker, can jointly solve forgery localization and natural-language report generation on the MMTT dataset, reaching 59.3 CIDEr on text generation and 73.67 IoU on mask prediction and thereby supplying the first public baseline for explainable multimedia forensics.

What carries the argument

ForgeryTalker, a unified architecture with an image encoder plus Q-former shared across a mask decoder and a text decoder that enables cross-modal reasoning between visual edits and linguistic descriptions.

If this is right

Forensic systems can now output both a visual map and a readable account of what was changed instead of a single yes/no score.
The dual-decoder design shows that mask and text outputs can be produced coherently from the same visual features.
The MMTT dataset supplies training pairs that link concrete editing operations to both spatial and linguistic ground truth.
Performance numbers of 59.3 CIDEr and 73.67 IoU set a measurable target for subsequent models that attempt the same joint task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same report-generation approach could be applied to video or audio manipulations if analogous process-derived annotations can be created.
If the generated reports prove reliable, they could serve as machine-generated evidence logs in legal or journalistic workflows.
Because the masks are derived from editing software logs, similar automatic annotation pipelines might be built for other image-editing domains without manual labeling.

Load-bearing premise

Process-derived masks and human-written descriptions accurately and completely capture the actual editing operations performed on each image.

What would settle it

Collect a set of facial images edited with operations absent from the MMTT training distribution, run ForgeryTalker on them, and have independent human raters compare the generated masks and reports against the true editing steps; systematic mismatch in either localization or explanation would falsify the baseline claim.

Figures

Figures reproduced from arXiv: 2412.19685 by Jingchun Lian, Lianwei Wu, Lingyu Liu, Li Zhu, Yaxiong Wang, Yujiao Wu, Zhedong Zheng.

**Figure 1.** Figure 1: The proposed framework combines forgery localization and interpretive analysis. The left panel illustrates dataset construction [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Annotation pipeline for forgery interpretation. Annotators review the original and forged images ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the MMTT dataset statistics. GAN [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of our ForgeryTalker. ForgeryTalker extends the InstructBlip framework by incorporating a Forgery Prompter Net [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Existing facial forgery detection methods typically focus on binary classification or pixel-level localization, providing little semantic insight into the nature of the manipulation. To address this, we introduce Forgery Attribution Report Generation, a new multimodal task that jointly localizes forged regions ("Where") and generates natural language explanations grounded in the editing process ("Why"). This dual-focus approach goes beyond traditional forensics, providing a comprehensive understanding of the manipulation. To enable research in this domain, we present Multi-Modal Tamper Tracing (MMTT), a large-scale dataset of 152,217 samples, each with a process-derived ground-truth mask and a human-authored textual description, ensuring high annotation precision and linguistic richness. We further propose ForgeryTalker, a unified end-to-end framework that integrates vision and language via a shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling coherent cross-modal reasoning. Experiments show that ForgeryTalker achieves competitive performance on both report generation and forgery localization subtasks, i.e., 59.3 CIDEr and 73.67 IoU, respectively, establishing a baseline for explainable multimedia forensics. Dataset and code will be released to foster future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task and dataset for semantic forgery reports, but the baseline rests on unvalidated process-derived masks and human texts.

read the letter

The paper's main contribution is the Forgery Attribution Report Generation task, which pairs region localization with natural-language explanations of the editing steps. They built the MMTT dataset of 152k samples using process-derived masks plus human-authored descriptions, and they present ForgeryTalker, a shared-encoder model with separate mask and text decoders. This moves the field past binary detection or pure pixel masks toward something more interpretable, and the decision to release the data and code is useful for follow-up work.

Referee Report

2 major / 2 minor

Summary. The paper introduces Forgery Attribution Report Generation, a multimodal task for localizing forged facial regions and generating natural-language explanations of the editing process. It presents the MMTT dataset (152,217 samples with process-derived masks and human-authored texts) and the ForgeryTalker model (shared encoder + dual decoders), reporting baseline results of 59.3 CIDEr on report generation and 73.67 IoU on localization.

Significance. If the ground-truth annotations prove reliable, the work supplies the first large-scale benchmark and baseline for explainable multimedia forensics, moving beyond binary detection or pixel localization. The planned public release of the dataset and code is a concrete strength that would enable reproducible follow-up research.

major comments (2)

[Abstract, §3] Abstract and §3 (Dataset): the claim of 'high annotation precision' for the 152k process-derived masks and human-authored texts is unsupported by any quantitative validation (inter-annotator agreement, consistency checks against source editing pipelines, or coverage statistics across manipulation types). Because both the 59.3 CIDEr and 73.67 IoU scores are measured against these targets, the absence of such validation directly undermines interpretability of the reported baseline performance.
[§5] §5 (Experiments): headline metrics are presented without baseline comparisons, statistical significance tests, or details on train/validation/test splits, making it impossible to assess whether ForgeryTalker constitutes a meaningful advance over prior unimodal forgery methods.

minor comments (2)

[§4] Notation for the dual-decoder architecture in §4 is introduced without an accompanying diagram or explicit equation for the joint loss, complicating reproduction.
[Table 1] Table 1 (dataset statistics) lists sample counts but omits breakdown by manipulation type or source dataset, which would help readers evaluate coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our work. We provide point-by-point responses to the major comments below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Dataset): the claim of 'high annotation precision' for the 152k process-derived masks and human-authored texts is unsupported by any quantitative validation (inter-annotator agreement, consistency checks against source editing pipelines, or coverage statistics across manipulation types). Because both the 59.3 CIDEr and 73.67 IoU scores are measured against these targets, the absence of such validation directly undermines interpretability of the reported baseline performance.

Authors: The masks in MMTT are process-derived, meaning they are generated automatically from the known editing operations applied to create each sample. This provides exact ground truth without manual annotation variability, so inter-annotator agreement is not applicable. The texts are human-authored following detailed guidelines based on the editing process. While we did not report quantitative validation metrics in the initial submission, we will update §3 to include a description of the annotation process, any consistency checks performed, and statistics on coverage across manipulation types to better substantiate the precision claim. revision: yes
Referee: [§5] §5 (Experiments): headline metrics are presented without baseline comparisons, statistical significance tests, or details on train/validation/test splits, making it impossible to assess whether ForgeryTalker constitutes a meaningful advance over prior unimodal forgery methods.

Authors: We acknowledge that the experimental results would be more informative with these additions. In the revised version of the paper, we will expand the Experiments section (§5) to include comparisons with relevant baseline methods from the forgery detection literature (adapted to the multimodal task where possible), report p-values or confidence intervals for statistical significance, and provide full details on the train/validation/test splits used in our evaluations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on newly introduced dataset

full rationale

The paper introduces the MMTT dataset (152k samples with process-derived masks and human-authored texts) and evaluates the ForgeryTalker model via standard metrics (CIDEr, IoU) on held-out data. These are direct empirical measurements against external ground truth, not quantities derived from the model's own fitted parameters or equations. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The derivation chain consists of dataset construction followed by supervised training and evaluation, which is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the fidelity of the newly created dataset annotations and the assumption that joint mask-text training yields coherent cross-modal reasoning; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Process-derived ground-truth masks and human-authored descriptions faithfully capture the editing operations.
These annotations serve as supervision for both localization and text generation.

pith-pipeline@v0.9.0 · 5763 in / 1117 out tokens · 20964 ms · 2026-05-23T06:31:16.657373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Generative adversarial network applications in industry 4.0: A review

Chafic Abou Akar, Rachelle Abdel Massih, Anthony Yaghi, Joe Khalil, Marc Kamradt, and Abdallah Makhoul. Generative adversarial network applications in industry 4.0: A review. International Journal of Computer Vision, 132(6):2195–2254, 2024. 3

work page 2024
[3]

Dif- fusionface: Towards a comprehensive dataset for diffusion-based face forgery analysis

Zhongxi Chen, Ke Sun, Ziyin Zhou, Xianming Lin, Xiaoshuai Sun, Liujuan Cao, and Rongrong Ji. Dif- fusionface: Towards a comprehensive dataset for diffusion-based face forgery analysis. arXiv preprint arXiv:2403.18471, 2024. 4

work page arXiv 2024
[4]

Instruct- blip: Towards general-purpose vision-language mod- els with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instruct- blip: Towards general-purpose vision-language mod- els with instruction tuning, 2023. 5, 7, 8

work page 2023
[5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems , 34:8780– 8794, 2021. 2

work page 2021
[6]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Can- ton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2006
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 5

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Forgerynet: A versatile benchmark for comprehensive forgery analysis

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guo- jun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4360–4369, 2021. 4

work page 2021
[9]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural in- formation processing systems, 33:6840–6851, 2020. 2

work page 2020
[10]

Segment and caption anything

Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13405–13417, 2024. 7, 8

work page 2024
[11]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Pro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2889–2898, 2020. 3, 4

work page 2020
[12]

Hcit: Deepfake video detection using a hybrid model of cnn features and vision transformer

Bachir Kaddar, Sid Ahmed Fezza, Wassim Hami- douche, Zahid Akhtar, and Abdenour Hadid. Hcit: Deepfake video detection using a hybrid model of cnn features and vision transformer. In 2021 Inter- national Conference on Visual Communications and Image Processing (VCIP), pages 1–5. IEEE, 2021. 3

work page 2021
[13]

A style- based generator architecture for generative adversar- ial networks

Tero Karras, Samuli Laine, and Timo Aila. A style- based generator architecture for generative adversar- ial networks. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 4401–4410, 2019. 3

work page 2019
[14]

Ex- ploiting spatiotemporal inconsistencies to detect deep- fake videos in the wild

Atharva Khedkar, Atharva Peshkar, Ashlesha Nag- dive, Mahendra Gaikwad, and Sudeep Baudha. Ex- ploiting spatiotemporal inconsistencies to detect deep- fake videos in the wild. In 2022 10th Interna- tional Conference on Emerging Trends in Engineer- ing and Technology-Signal and Information Process- ing (ICETET-SIP-22), pages 1–6. IEEE, 2022. 3

work page 2022
[15]

Dlib-ml: A machine learning toolkit

Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755– 1758, 2009. 4

work page 2009
[16]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 3, 7

work page 2023
[17]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 7, 8

work page 2024
[18]

Deepfake detection through key video frame extraction using gan

S Lalitha and Kavitha Sooda. Deepfake detection through key video frame extraction using gan. In2022 International Conference on Automation, Computing and Renewable Systems (ICACRS) , pages 859–863. IEEE, 2022. 3

work page 2022
[19]

Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild

Trung-Nghia Le, Huy H Nguyen, Junichi Yamag- ishi, and Isao Echizen. Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 10117–10127, 2021. 4

work page 2021
[20]

Faceshifter: Towards high fidelity and occlusion aware face swapping

Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019. 4

work page arXiv 1912
[21]

Mat: Mask-aware transformer for large hole image inpainting

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia. Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 10758–10768, 2022. 3

work page 2022
[22]

Celeb-df: A large-scale challenging dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020. 3, 4

work page 2020
[23]

Maskgan: A facial fusion algorithm for deepfake image detection

Dazhuang Liu, Zhen Yang, Ru Zhang, and Jianyi Liu. Maskgan: A facial fusion algorithm for deepfake image detection. In 2022 International Conference on Computers and Artificial Intelligence Technologies (CAIT), pages 71–78. IEEE, 2022. 3

work page 2022
[24]

Deepface- lab: Integrated, flexible and extensible face-swapping framework

Kunlin Liu, Ivan Perov, Daiheng Gao, Nikolay Cher- voniy, Wenbo Zhou, and Weiming Zhang. Deepface- lab: Integrated, flexible and extensible face-swapping framework. Pattern Recognition, 141:109628, 2023. 2

work page 2023
[25]

An intriguing failing of convolutional neu- ral networks and the coordconv solution

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neu- ral networks and the coordconv solution. Advances in neural information processing systems, 31, 2018. 6

work page 2018
[26]

Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection

Joao C Neves, Ruben Tolosana, Ruben Vera- Rodriguez, Vasco Lopes, Hugo Proenc ¸a, and Julian Fierrez. Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection. IEEE Journal of Selected Topics in Signal Processing, 14(5):1038–1048, 2020. 3

work page 2020
[27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

An experimental evaluation on deepfake detection using deep face recognition

Sreeraj Ramachandran, Aakash Varma Nadimpalli, and Ajita Rattani. An experimental evaluation on deepfake detection using deep face recognition. In 2021 International Carnahan Conference on Security Technology (ICCST), pages 1–6. IEEE, 2021. 3

work page 2021
[29]

Deepfake detection: A systematic literature review

Md Shohel Rana, Mohammad Nur Nobi, Beddhu Mu- rali, and Andrew H Sung. Deepfake detection: A systematic literature review. IEEE access, 10:25494– 25513, 2022. 2

work page 2022
[30]

Deep fake face detection using convolutional neural networks

Mj Alben Richards, E Kaaviya Varshini, N Diviya, P Prakash, P Kasthuri, and A Sasithradevi. Deep fake face detection using convolutional neural networks. In 2023 12th International Conference on Advanced Computing (ICoAC), pages 1–5. IEEE, 2023. 3

work page 2023
[31]

Focal loss for dense object detection

T-YLPG Ross and GKHP Doll ´ar. Focal loss for dense object detection. In proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2980–2988, 2017. 3

work page 2017
[32]

Faceforensics++: Learning to detect manipulated fa- cial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated fa- cial images. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 1–11,

work page
[33]

Recur- rent convolutional strategies for face manipulation de- tection in videos

Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael Ab- dAlmageed, Iacopo Masi, and Prem Natarajan. Recur- rent convolutional strategies for face manipulation de- tection in videos. Interfaces (GUI), 3(1):80–87, 2019. 3

work page 2019
[34]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Face forgery detection based on facial region displacement trajec- tory series

YuYang Sun, ZhiYong Zhang, Isao Echizen, Huy H Nguyen, ChangZhen Qiu, and Lu Sun. Face forgery detection based on facial region displacement trajec- tory series. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 633–642, 2023. 3

work page 2023
[36]

Media forensics and deepfakes: an overview

Luisa Verdoliva. Media forensics and deepfakes: an overview. IEEE journal of selected topics in signal processing, 14(5):910–932, 2020. 2

work page 2020
[37]

Learning domain-invariant representation for general- izing face forgery detection

Yuanlu Wu, Yan Wo, Caiyu Li, and Guoqiang Han. Learning domain-invariant representation for general- izing face forgery detection. Computers & Security , 130:103280, 2023. 2

work page 2023
[38]

Df40: Toward next-generation deepfake detection,

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, and Li Yuan. Df40: Toward next-generation deepfake detection,

work page
[39]

A survey on deepfake video detection

Peipeng Yu, Zhihua Xia, Jianwei Fei, and Yujiang Lu. A survey on deepfake video detection. Iet Biometrics, 10(6):607–624, 2021. 2

work page 2021
[40]

Os- prey: Pixel understanding with visual instruction tun- ing

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Os- prey: Pixel understanding with visual instruction tun- ing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 28202–28211, 2024. 7, 8

work page 2024
[41]

Gen- face: A large-scale fine-grained face forgery bench- mark and cross appearance-edge learning

Yaning Zhang, Zitong Yu, Tianyi Wang, Xiaobin Huang, Linlin Shen, Zan Gao, and Jianfeng Ren. Gen- face: A large-scale fine-grained face forgery bench- mark and cross appearance-edge learning. IEEE Transactions on Information Forensics and Security ,

work page
[42]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Si- wei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022. 3

work page 2022

[1] [1]

Generative adversarial network applications in industry 4.0: A review

Chafic Abou Akar, Rachelle Abdel Massih, Anthony Yaghi, Joe Khalil, Marc Kamradt, and Abdallah Makhoul. Generative adversarial network applications in industry 4.0: A review. International Journal of Computer Vision, 132(6):2195–2254, 2024. 3

work page 2024

[2] [3]

Dif- fusionface: Towards a comprehensive dataset for diffusion-based face forgery analysis

Zhongxi Chen, Ke Sun, Ziyin Zhou, Xianming Lin, Xiaoshuai Sun, Liujuan Cao, and Rongrong Ji. Dif- fusionface: Towards a comprehensive dataset for diffusion-based face forgery analysis. arXiv preprint arXiv:2403.18471, 2024. 4

work page arXiv 2024

[3] [4]

Instruct- blip: Towards general-purpose vision-language mod- els with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instruct- blip: Towards general-purpose vision-language mod- els with instruction tuning, 2023. 5, 7, 8

work page 2023

[4] [5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems , 34:8780– 8794, 2021. 2

work page 2021

[5] [6]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Can- ton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2006

[6] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 5

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [8]

Forgerynet: A versatile benchmark for comprehensive forgery analysis

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guo- jun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4360–4369, 2021. 4

work page 2021

[8] [9]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural in- formation processing systems, 33:6840–6851, 2020. 2

work page 2020

[9] [10]

Segment and caption anything

Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13405–13417, 2024. 7, 8

work page 2024

[10] [11]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Pro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2889–2898, 2020. 3, 4

work page 2020

[11] [12]

Hcit: Deepfake video detection using a hybrid model of cnn features and vision transformer

Bachir Kaddar, Sid Ahmed Fezza, Wassim Hami- douche, Zahid Akhtar, and Abdenour Hadid. Hcit: Deepfake video detection using a hybrid model of cnn features and vision transformer. In 2021 Inter- national Conference on Visual Communications and Image Processing (VCIP), pages 1–5. IEEE, 2021. 3

work page 2021

[12] [13]

A style- based generator architecture for generative adversar- ial networks

Tero Karras, Samuli Laine, and Timo Aila. A style- based generator architecture for generative adversar- ial networks. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 4401–4410, 2019. 3

work page 2019

[13] [14]

Ex- ploiting spatiotemporal inconsistencies to detect deep- fake videos in the wild

Atharva Khedkar, Atharva Peshkar, Ashlesha Nag- dive, Mahendra Gaikwad, and Sudeep Baudha. Ex- ploiting spatiotemporal inconsistencies to detect deep- fake videos in the wild. In 2022 10th Interna- tional Conference on Emerging Trends in Engineer- ing and Technology-Signal and Information Process- ing (ICETET-SIP-22), pages 1–6. IEEE, 2022. 3

work page 2022

[14] [15]

Dlib-ml: A machine learning toolkit

Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755– 1758, 2009. 4

work page 2009

[15] [16]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4015–4026, 2023. 3, 7

work page 2023

[16] [17]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 7, 8

work page 2024

[17] [18]

Deepfake detection through key video frame extraction using gan

S Lalitha and Kavitha Sooda. Deepfake detection through key video frame extraction using gan. In2022 International Conference on Automation, Computing and Renewable Systems (ICACRS) , pages 859–863. IEEE, 2022. 3

work page 2022

[18] [19]

Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild

Trung-Nghia Le, Huy H Nguyen, Junichi Yamag- ishi, and Isao Echizen. Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 10117–10127, 2021. 4

work page 2021

[19] [20]

Faceshifter: Towards high fidelity and occlusion aware face swapping

Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019. 4

work page arXiv 1912

[20] [21]

Mat: Mask-aware transformer for large hole image inpainting

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia. Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 10758–10768, 2022. 3

work page 2022

[21] [22]

Celeb-df: A large-scale challenging dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020. 3, 4

work page 2020

[22] [23]

Maskgan: A facial fusion algorithm for deepfake image detection

Dazhuang Liu, Zhen Yang, Ru Zhang, and Jianyi Liu. Maskgan: A facial fusion algorithm for deepfake image detection. In 2022 International Conference on Computers and Artificial Intelligence Technologies (CAIT), pages 71–78. IEEE, 2022. 3

work page 2022

[23] [24]

Deepface- lab: Integrated, flexible and extensible face-swapping framework

Kunlin Liu, Ivan Perov, Daiheng Gao, Nikolay Cher- voniy, Wenbo Zhou, and Weiming Zhang. Deepface- lab: Integrated, flexible and extensible face-swapping framework. Pattern Recognition, 141:109628, 2023. 2

work page 2023

[24] [25]

An intriguing failing of convolutional neu- ral networks and the coordconv solution

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neu- ral networks and the coordconv solution. Advances in neural information processing systems, 31, 2018. 6

work page 2018

[25] [26]

Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection

Joao C Neves, Ruben Tolosana, Ruben Vera- Rodriguez, Vasco Lopes, Hugo Proenc ¸a, and Julian Fierrez. Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection. IEEE Journal of Selected Topics in Signal Processing, 14(5):1038–1048, 2020. 3

work page 2020

[26] [27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

An experimental evaluation on deepfake detection using deep face recognition

Sreeraj Ramachandran, Aakash Varma Nadimpalli, and Ajita Rattani. An experimental evaluation on deepfake detection using deep face recognition. In 2021 International Carnahan Conference on Security Technology (ICCST), pages 1–6. IEEE, 2021. 3

work page 2021

[28] [29]

Deepfake detection: A systematic literature review

Md Shohel Rana, Mohammad Nur Nobi, Beddhu Mu- rali, and Andrew H Sung. Deepfake detection: A systematic literature review. IEEE access, 10:25494– 25513, 2022. 2

work page 2022

[29] [30]

Deep fake face detection using convolutional neural networks

Mj Alben Richards, E Kaaviya Varshini, N Diviya, P Prakash, P Kasthuri, and A Sasithradevi. Deep fake face detection using convolutional neural networks. In 2023 12th International Conference on Advanced Computing (ICoAC), pages 1–5. IEEE, 2023. 3

work page 2023

[30] [31]

Focal loss for dense object detection

T-YLPG Ross and GKHP Doll ´ar. Focal loss for dense object detection. In proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2980–2988, 2017. 3

work page 2017

[31] [32]

Faceforensics++: Learning to detect manipulated fa- cial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated fa- cial images. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 1–11,

work page

[32] [33]

Recur- rent convolutional strategies for face manipulation de- tection in videos

Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael Ab- dAlmageed, Iacopo Masi, and Prem Natarajan. Recur- rent convolutional strategies for face manipulation de- tection in videos. Interfaces (GUI), 3(1):80–87, 2019. 3

work page 2019

[33] [34]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[34] [35]

Face forgery detection based on facial region displacement trajec- tory series

YuYang Sun, ZhiYong Zhang, Isao Echizen, Huy H Nguyen, ChangZhen Qiu, and Lu Sun. Face forgery detection based on facial region displacement trajec- tory series. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 633–642, 2023. 3

work page 2023

[35] [36]

Media forensics and deepfakes: an overview

Luisa Verdoliva. Media forensics and deepfakes: an overview. IEEE journal of selected topics in signal processing, 14(5):910–932, 2020. 2

work page 2020

[36] [37]

Learning domain-invariant representation for general- izing face forgery detection

Yuanlu Wu, Yan Wo, Caiyu Li, and Guoqiang Han. Learning domain-invariant representation for general- izing face forgery detection. Computers & Security , 130:103280, 2023. 2

work page 2023

[37] [38]

Df40: Toward next-generation deepfake detection,

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, and Li Yuan. Df40: Toward next-generation deepfake detection,

work page

[38] [39]

A survey on deepfake video detection

Peipeng Yu, Zhihua Xia, Jianwei Fei, and Yujiang Lu. A survey on deepfake video detection. Iet Biometrics, 10(6):607–624, 2021. 2

work page 2021

[39] [40]

Os- prey: Pixel understanding with visual instruction tun- ing

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Os- prey: Pixel understanding with visual instruction tun- ing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 28202–28211, 2024. 7, 8

work page 2024

[40] [41]

Gen- face: A large-scale fine-grained face forgery bench- mark and cross appearance-edge learning

Yaning Zhang, Zitong Yu, Tianyi Wang, Xiaobin Huang, Linlin Shen, Zan Gao, and Jianfeng Ren. Gen- face: A large-scale fine-grained face forgery bench- mark and cross appearance-edge learning. IEEE Transactions on Information Forensics and Security ,

work page

[41] [42]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Si- wei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022. 3

work page 2022