Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

Mohammadreza Rashidi; Raja Hashim Ali; Sami Ur Rahman

arxiv: 2605.17573 · v1 · pith:OV5MD73Dnew · submitted 2026-05-17 · 💻 cs.CV · cs.CR

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

Mohammadreza Rashidi , Raja Hashim Ali , Sami Ur Rahman This is my paper

Pith reviewed 2026-05-20 13:04 UTC · model grok-4.3

classification 💻 cs.CV cs.CR

keywords deepfake detectiontemporal artifacts3D convolutional neural networkssocial mediavideo forgerycross-dataset transferR3D-18temporal consistency

0 comments

The pith

Temporal artifacts in deepfake videos give 3D CNN detectors a signal that survives social-media re-encoding better than spatial features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a 3D convolutional network can exploit inconsistencies across video frames to detect deepfakes more reliably than frame-by-frame spatial methods, especially once videos have been compressed and re-encoded for social platforms. A sympathetic reader would care because current detectors lose accuracy as generator quality improves and as content moves through real-world distribution pipelines. The authors train an R3D-18 backbone on 16-frame clips from DeepfakeTIMIT, starting from Kinetics-400 weights and adding a temporal-consistency term to the loss. They record 92.8 percent accuracy on the source dataset at 128 by 128 resolution and 76.4 percent on FaceForensics++ with no fine-tuning, with ablations isolating the contribution of transfer learning, face tracking, and the temporal regularizer.

Core claim

The authors show that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding. A 3D CNN based on R3D-18 processes 16-frame clips from DeepfakeTIMIT, initialized from action-recognition weights and trained with a composite loss of binary cross-entropy plus a temporal-consistency regularizer. The model reaches 92.8 percent accuracy on intra-dataset tests at 128 by 128 resolution and 76.4 percent on cross-dataset transfer to FaceForensics++ without further training; ablations attribute 7.2 points to transfer learning, 3.5 points to face tracking, and further gains on high-quality fakes to the temporal term.

What carries the argument

3D CNN based on R3D-18 that processes 16-frame clips and adds a temporal-consistency regularizer to the loss function to capture inconsistencies across frames.

If this is right

Temporal artifacts provide a more robust detection signal than spatial features alone when videos undergo social-media re-encoding.
The 3D CNN approach achieves meaningful cross-dataset transfer without fine-tuning.
Transfer learning from action-recognition weights improves accuracy by 7.2 percentage points.
Face tracking adds 3.5 percentage points to overall performance.
Temporal consistency regularization yields extra gains specifically on high-quality generated fakes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Moderation systems on social platforms could adopt short-clip temporal analysis to flag content before or during upload.
The same temporal focus might be tested on live video streams or on deepfakes from generators released after DeepfakeTIMIT.
Hybrid detectors that combine this temporal cue with newer spatial methods could be evaluated for resistance to future generator improvements.

Load-bearing premise

The temporal inconsistencies present in the DeepfakeTIMIT training clips remain detectable and representative after the re-encoding and compression steps typical of real social-media distribution.

What would settle it

Measure whether detection accuracy falls sharply when the trained model is evaluated on deepfakes produced by generators that suppress temporal inconsistencies or on videos that have passed through multiple rounds of platform re-encoding.

Figures

Figures reproduced from arXiv: 2605.17573 by Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman.

**Figure 1.** Figure 1: Complete workflow pipeline for temporal deepfake detection showing video preprocessing, face detection and tracking, temporal sequence extraction, 3D [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed workflow showing user input and processing pipeline for inference from the model. The inference pipeline consists of video preprocessing, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: DeepfakeTIMIT dataset organization showing the distribution of real [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sample frames from DeepfakeTIMIT dataset showing authentic (top [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise structure of the R3D-18 architecture used for deepfake detection, showing 3D residual blocks, temporal pooling, and the final classification [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed ablation study results showing the contribution of different [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Temporal artifact visualization showing specific frame sequences [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper shows solid cross-dataset transfer with R3D-18 and temporal regularization on deepfakes but provides no direct test of robustness to social media re-encoding.

read the letter

The one thing to know is that this paper takes a standard 3D CNN architecture, R3D-18, and trains it on short video clips to pick up temporal inconsistencies in deepfakes. They add a regularizer that encourages consistency across frames and show it helps on higher quality fakes. The numbers are 92.8 percent accuracy inside DeepfakeTIMIT and 76.4 percent when moving straight to FaceForensics++ without extra training. What the work does reasonably well is lay out some ablations. Transfer learning from action recognition weights adds about seven points, face tracking another three and a half, and the temporal term gives extra lift when the fakes are good. The cross-dataset result without fine-tuning is a solid data point that other groups can compare against. The main weakness is that the abstract makes a strong claim about the method working after social media re-encoding and compression, yet the experiments do not include that step. All the reported accuracies come from the original dataset clips. There are no results after applying typical platform processing like H.264 encoding or reduced bitrates. That makes the generalization argument rest on the assumption that the temporal artifacts stay detectable, rather than on direct evidence. The abstract also leaves out specifics like the exact train-test splits, whether they ran multiple seeds, or error bars on the accuracies. This kind of paper is aimed at researchers in video forensics who are looking for practical ways to handle temporal signals in detection. A reader who wants to see how 3D convolutions compare to frame-by-frame approaches on public deepfake sets would find the numbers and ablations worth looking at. It does not introduce a new paradigm but it does give a clear extension of existing 3D models to this task. I would bring this to a reading group to talk through the ablation results and whether the temporal regularizer is the right way to capture the artifacts. I probably would not cite it in my own papers in the next year unless I needed a reference for 3D CNN baselines on deepfakes. It deserves to go through peer review because the concrete numbers and the cross-dataset transfer are specific enough that referees can evaluate them and suggest the missing re-encoding tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces a 3D CNN detector based on R3D-18, initialized from Kinetics-400 weights and trained on 16-frame clips from DeepfakeTIMIT with a composite loss including binary cross-entropy and a temporal-consistency regularizer. It reports 92.8% intra-dataset accuracy at 128x128 resolution, 76.4% cross-dataset transfer to FaceForensics++ without fine-tuning, and ablations attributing gains to transfer learning (7.2 pp), face tracking (3.5 pp), and temporal regularization. The central claim is that temporal artifacts generalize more broadly than spatial ones and survive social-media re-encoding and compression.

Significance. If the generalization claim holds, the work would provide a concrete advance over frame-level detectors by showing that temporal inconsistencies remain detectable after typical social-media distribution shifts, supported by cross-dataset transfer and component ablations. The use of action-recognition pretraining and explicit regularization for temporal consistency are positive elements that could inform future video-based detectors.

major comments (2)

[Abstract] Abstract and results sections: the central claim that temporal artifacts 'survive social-media re-encoding' is not directly tested. Reported experiments cover only unmodified DeepfakeTIMIT clips and cross-dataset transfer to FaceForensics++ without applying H.264 re-encoding, bitrate reduction, or container changes to the test clips, leaving the robustness assertion as an extrapolation rather than a measured result.
[Methods] Methods and experimental setup: the manuscript lacks specification of exact training/validation splits, statistical significance tests, error bars on the reported accuracies (92.8% and 76.4%), and hyperparameter choices for the temporal-consistency regularization weight, which weakens confidence in the generalization and ablation claims.

minor comments (2)

[Abstract] The abstract mentions 'high-quality 128x128 GAN output' but does not clarify whether this refers to a specific subset of DeepfakeTIMIT or an additional experiment; a dedicated table or figure would improve clarity.
[Methods] Notation for the composite loss function and the exact form of the temporal-consistency regularizer should be defined explicitly with an equation number for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the central claim that temporal artifacts 'survive social-media re-encoding' is not directly tested. Reported experiments cover only unmodified DeepfakeTIMIT clips and cross-dataset transfer to FaceForensics++ without applying H.264 re-encoding, bitrate reduction, or container changes to the test clips, leaving the robustness assertion as an extrapolation rather than a measured result.

Authors: We agree the claim is not directly measured via explicit re-encoding on test clips. The cross-dataset results on FaceForensics++ provide supporting evidence of generalization under distribution shifts that include compression differences, but this remains indirect. We will revise the abstract and results to qualify the language, stating that temporal artifacts yield stronger cross-dataset generalization, and explicitly note the absence of direct social-media re-encoding tests as a limitation. revision: yes
Referee: [Methods] Methods and experimental setup: the manuscript lacks specification of exact training/validation splits, statistical significance tests, error bars on the reported accuracies (92.8% and 76.4%), and hyperparameter choices for the temporal-consistency regularization weight, which weakens confidence in the generalization and ablation claims.

Authors: We accept this criticism. The revised manuscript will specify the subject-independent 80/20 train/validation split on DeepfakeTIMIT, report standard deviations from five independent runs as error bars, include statistical significance testing (McNemar's test for pairwise model comparisons), and state the regularization weight (0.5, selected via grid search with ablation). These details will be added to the Methods and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity: results from held-out evaluation on public datasets

full rationale

The paper trains an R3D-18 model on DeepfakeTIMIT clips with a composite loss and reports accuracy on intra-dataset held-out splits plus cross-dataset transfer to FaceForensics++. These are standard empirical measurements on fixed public benchmarks rather than quantities fitted to the target claim or derived by re-using the same inputs. No equations, self-citations, or ansatzes are shown that would reduce the reported generalization statement to a definitional tautology or a fitted parameter renamed as a prediction. The central claim about survival under social-media re-encoding is an extrapolation from the given experiments, but the derivation chain itself contains no self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that temporal artifacts are more robust than spatial ones; no new physical entities or mathematical axioms are introduced beyond standard supervised learning assumptions and the use of pre-trained Kinetics weights.

free parameters (1)

temporal consistency regularization weight
Scalar multiplier in the composite loss; value not stated in abstract but must be chosen to balance the two terms.

axioms (1)

domain assumption Temporal inconsistencies in GAN-generated faces remain detectable after social-media re-encoding and compression
Invoked in the final sentence of the abstract as the basis for claiming broader generalization.

pith-pipeline@v0.9.0 · 5740 in / 1225 out tokens · 56835 ms · 2026-05-20T13:04:25.921536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8, flipAt512 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The model processes 16-frame clips from the DeepfakeTIMIT dataset... temporal-consistency regularizer L_tc = 1/(T-1) Σ ||ϕ_{t+1}(x) - ϕ_t(x)||²_2
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat, embed unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R3D-18... transfer learning from Kinetics-400

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Towards more general video-based deepfake detection through facial component guided adaptation for foundation model,

Y .-H. Han, T.-M. Huang, K.-L. Hua, and J.-C. Chen, “Towards more general video-based deepfake detection through facial component guided adaptation for foundation model,”arXiv preprint arXiv:2404.05583, 2024

work page arXiv 2024
[2]

Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,

D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,”arXiv preprint arXiv:2501.01184, 2025

work page arXiv 2025
[3]

Learning natural consistency representation for face forgery video detection,

D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,”arXiv preprint arXiv:2407.10550, 2024

work page arXiv 2024
[4]

Reduced spatial dependency for more general video-level deepfake detection,

B. Chu, X. Xu, Y . Zhang, W. You, and L. Zhou, “Reduced spatial dependency for more general video-level deepfake detection,”arXiv preprint arXiv:2503.03270, 2025

work page arXiv 2025
[5]

Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning

Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,”arXiv preprint arXiv:2408.17065, 2024

work page arXiv 2024
[6]

A multimodal framework for deepfake detection,

K. Gandhi, P. Kulkarni, T. Shah, P. Chaudhari, M. Narvekar, and K. Ghag, “A multimodal framework for deepfake detection,”arXiv preprint arXiv:2410.03487, 2024

work page arXiv 2024
[7]

Adaptive meta-learning for robust deepfake detection: A multi-agent framework to data drift and model generaliza- tion,

D. S. P and B. N. Subudhi, “Adaptive meta-learning for robust deepfake detection: A multi-agent framework to data drift and model generaliza- tion,”arXiv preprint arXiv:2411.08148, 2024

work page arXiv 2024
[8]

FaceForensics++: Learning to detect manipulated facial images,

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to detect manipulated facial images,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1–11

work page 2019
[9]

FAME: A lightweight spatio- temporal network for model attribution of face-swap deepfakes,

W. Ahmad, Y .-T. Peng, and Y .-H. Chang, “FAME: A lightweight spatio- temporal network for model attribution of face-swap deepfakes,”Expert Systems with Applications, 2025, arXiv:2506.11477

work page arXiv 2025
[10]

Faster than lies: Real-time deepfake detection using binary neural networks,

R. Lanzino, F. Fontana, A. Diko, M. R. Marini, and L. Cinque, “Faster than lies: Real-time deepfake detection using binary neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 3771–3780

work page 2024
[11]

Real-time deepfake detection in the real-world, 2024

B. Cavia, E. Horwitz, T. Reiss, and Y . Hoshen, “Real-time deepfake detection in the real-world,”arXiv preprint arXiv:2406.09398, 2024

work page arXiv 2024
[12]

UniForensics: Face forgery detection via general facial representation,

Z. Fang, H. Zhao, T. Wei, W. Zhou, M. Wan, Z. Wang, W. Zhang, and N. Yu, “UniForensics: Face forgery detection via general facial representation,”arXiv preprint arXiv:2407.19079, 2024

work page arXiv 2024
[13]

FakeFormer: Efficient vulnerability-driven transformers for generalisable deepfake detection,

D. Nguyen, M. Astrid, E. Ghorbel, and D. Aouada, “FakeFormer: Efficient vulnerability-driven transformers for generalisable deepfake detection,”arXiv preprint arXiv:2410.21964, 2024

work page arXiv 2024
[14]

Spatio-temporal knowledge dis- tilled video vision transformer (STKD-VViT) for multimodal deepfake detection,

S. Usmani, S. Kumar, and D. Sadhya, “Spatio-temporal knowledge dis- tilled video vision transformer (STKD-VViT) for multimodal deepfake detection,”Neurocomputing, 2024

work page 2024
[15]

SFormer: An end-to-end spatio- temporal transformer architecture for deepfake detection,

S. Kingra, N. Aggarwal, and N. Kaur, “SFormer: An end-to-end spatio- temporal transformer architecture for deepfake detection,”Forensic Science International: Digital Investigation, 2024

work page 2024
[16]

CoDeiT: Contrastive data-efficient transformers for deepfake detection,

J. Zakkam, U. Jayaraman, S. Sahayam, and A. Rattani, “CoDeiT: Contrastive data-efficient transformers for deepfake detection,” inPro- ceedings of the International Conference on Pattern Recognition (ICPR), ser. Lecture Notes in Computer Science, vol. 15332, 2024

work page 2024
[17]

Deepfake de- tection with spatio-temporal consistency and attention,

Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake de- tection with spatio-temporal consistency and attention,”arXiv preprint arXiv:2502.08216, 2025

work page arXiv 2025
[18]

DF40: Toward next-generation deepfake detection

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, and L. Yuan, “DF40: Toward next-generation deepfake detection,”arXiv preprint arXiv:2406.13495, 2024

work page arXiv 2024
[19]

Frequency- aware deepfake detection: Improving generalizability through frequency space learning,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Frequency- aware deepfake detection: Improving generalizability through frequency space learning,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

work page 2024
[20]

Compressed deepfake video detection based on 3D spatiotemporal trajectories,

Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3D spatiotemporal trajectories,”arXiv preprint arXiv:2404.18149, 2024

work page arXiv 2024
[21]

Wavelet- driven generalizable framework for deepfake face forgery detection,

L. B. Baru, R. Boddeda, S. A. Patel, and S. M. Gajapaka, “Wavelet- driven generalizable framework for deepfake face forgery detection,” arXiv preprint arXiv:2409.18301, 2024

work page arXiv 2024
[22]

Frequency-domain masking and spatial interaction for deepfake detection,

X. Luo and Y . Wang, “Frequency-domain masking and spatial interaction for deepfake detection,”Electronics, vol. 14, no. 7, p. 1302, 2025

work page 2025
[23]

A VFF: Audio-visual feature fusion for video deepfake detection,

T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “A VFF: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[24]

Contextual cross-modal attention for audio-visual deepfake detection and localization,

V . S. Katamneni and A. Rattani, “Contextual cross-modal attention for audio-visual deepfake detection and localization,”arXiv preprint arXiv:2408.01532, 2024

work page arXiv 2024
[25]

HFMF: Hierarchical fusion meets multi-stream models for deepfake detection,

A. Mehta, B. McArthur, N. Kolloju, and Z. Tu, “HFMF: Hierarchical fusion meets multi-stream models for deepfake detection,”arXiv preprint arXiv:2501.05631, 2025

work page arXiv 2025
[26]

Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges,

P. Liu, Q. Tao, and J. T. Zhou, “Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges,”arXiv preprint arXiv:2406.06965, 2024

work page arXiv 2024
[27]

A timely survey on vision transformer for deepfake detection,

Z. Wang, Z. Cheng, J. Xiong, X. Xu, T. Li, B. Veeravalli, and X. Yang, “A timely survey on vision transformer for deepfake detection,”arXiv preprint arXiv:2405.08463, 2024

work page arXiv 2024
[28]

Exploring self-supervised vision transformers for deepfake detection: A comparative analysis,

H. H. Nguyen, J. Yamagishi, and I. Echizen, “Exploring self-supervised vision transformers for deepfake detection: A comparative analysis,” arXiv preprint arXiv:2405.00355, 2024

work page arXiv 2024
[29]

Texture, shape and order matter: A new transformer design for sequential DeepFake detection,

Y . Li, Y . Li, X. Wang, B. Wu, J. Zhou, and J. Dong, “Texture, shape and order matter: A new transformer design for sequential DeepFake detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 202–211

work page 2025
[30]

Learning spatiotemporal features with 3D convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497

work page 2015
[31]

Quo vadis, action recognition? A new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724– 4733

work page 2017
[32]

Recurrent convolutional strategies for face manipulation detection in videos,

E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natara- jan, “Recurrent convolutional strategies for face manipulation detection in videos,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019

work page 2019

[1] [1]

Towards more general video-based deepfake detection through facial component guided adaptation for foundation model,

Y .-H. Han, T.-M. Huang, K.-L. Hua, and J.-C. Chen, “Towards more general video-based deepfake detection through facial component guided adaptation for foundation model,”arXiv preprint arXiv:2404.05583, 2024

work page arXiv 2024

[2] [2]

Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,

D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,”arXiv preprint arXiv:2501.01184, 2025

work page arXiv 2025

[3] [3]

Learning natural consistency representation for face forgery video detection,

D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,”arXiv preprint arXiv:2407.10550, 2024

work page arXiv 2024

[4] [4]

Reduced spatial dependency for more general video-level deepfake detection,

B. Chu, X. Xu, Y . Zhang, W. You, and L. Zhou, “Reduced spatial dependency for more general video-level deepfake detection,”arXiv preprint arXiv:2503.03270, 2025

work page arXiv 2025

[5] [5]

Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning

Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,”arXiv preprint arXiv:2408.17065, 2024

work page arXiv 2024

[6] [6]

A multimodal framework for deepfake detection,

K. Gandhi, P. Kulkarni, T. Shah, P. Chaudhari, M. Narvekar, and K. Ghag, “A multimodal framework for deepfake detection,”arXiv preprint arXiv:2410.03487, 2024

work page arXiv 2024

[7] [7]

Adaptive meta-learning for robust deepfake detection: A multi-agent framework to data drift and model generaliza- tion,

D. S. P and B. N. Subudhi, “Adaptive meta-learning for robust deepfake detection: A multi-agent framework to data drift and model generaliza- tion,”arXiv preprint arXiv:2411.08148, 2024

work page arXiv 2024

[8] [8]

FaceForensics++: Learning to detect manipulated facial images,

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to detect manipulated facial images,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1–11

work page 2019

[9] [9]

FAME: A lightweight spatio- temporal network for model attribution of face-swap deepfakes,

W. Ahmad, Y .-T. Peng, and Y .-H. Chang, “FAME: A lightweight spatio- temporal network for model attribution of face-swap deepfakes,”Expert Systems with Applications, 2025, arXiv:2506.11477

work page arXiv 2025

[10] [10]

Faster than lies: Real-time deepfake detection using binary neural networks,

R. Lanzino, F. Fontana, A. Diko, M. R. Marini, and L. Cinque, “Faster than lies: Real-time deepfake detection using binary neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 3771–3780

work page 2024

[11] [11]

Real-time deepfake detection in the real-world, 2024

B. Cavia, E. Horwitz, T. Reiss, and Y . Hoshen, “Real-time deepfake detection in the real-world,”arXiv preprint arXiv:2406.09398, 2024

work page arXiv 2024

[12] [12]

UniForensics: Face forgery detection via general facial representation,

Z. Fang, H. Zhao, T. Wei, W. Zhou, M. Wan, Z. Wang, W. Zhang, and N. Yu, “UniForensics: Face forgery detection via general facial representation,”arXiv preprint arXiv:2407.19079, 2024

work page arXiv 2024

[13] [13]

FakeFormer: Efficient vulnerability-driven transformers for generalisable deepfake detection,

D. Nguyen, M. Astrid, E. Ghorbel, and D. Aouada, “FakeFormer: Efficient vulnerability-driven transformers for generalisable deepfake detection,”arXiv preprint arXiv:2410.21964, 2024

work page arXiv 2024

[14] [14]

Spatio-temporal knowledge dis- tilled video vision transformer (STKD-VViT) for multimodal deepfake detection,

S. Usmani, S. Kumar, and D. Sadhya, “Spatio-temporal knowledge dis- tilled video vision transformer (STKD-VViT) for multimodal deepfake detection,”Neurocomputing, 2024

work page 2024

[15] [15]

SFormer: An end-to-end spatio- temporal transformer architecture for deepfake detection,

S. Kingra, N. Aggarwal, and N. Kaur, “SFormer: An end-to-end spatio- temporal transformer architecture for deepfake detection,”Forensic Science International: Digital Investigation, 2024

work page 2024

[16] [16]

CoDeiT: Contrastive data-efficient transformers for deepfake detection,

J. Zakkam, U. Jayaraman, S. Sahayam, and A. Rattani, “CoDeiT: Contrastive data-efficient transformers for deepfake detection,” inPro- ceedings of the International Conference on Pattern Recognition (ICPR), ser. Lecture Notes in Computer Science, vol. 15332, 2024

work page 2024

[17] [17]

Deepfake de- tection with spatio-temporal consistency and attention,

Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake de- tection with spatio-temporal consistency and attention,”arXiv preprint arXiv:2502.08216, 2025

work page arXiv 2025

[18] [18]

DF40: Toward next-generation deepfake detection

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, and L. Yuan, “DF40: Toward next-generation deepfake detection,”arXiv preprint arXiv:2406.13495, 2024

work page arXiv 2024

[19] [19]

Frequency- aware deepfake detection: Improving generalizability through frequency space learning,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Frequency- aware deepfake detection: Improving generalizability through frequency space learning,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

work page 2024

[20] [20]

Compressed deepfake video detection based on 3D spatiotemporal trajectories,

Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3D spatiotemporal trajectories,”arXiv preprint arXiv:2404.18149, 2024

work page arXiv 2024

[21] [21]

Wavelet- driven generalizable framework for deepfake face forgery detection,

L. B. Baru, R. Boddeda, S. A. Patel, and S. M. Gajapaka, “Wavelet- driven generalizable framework for deepfake face forgery detection,” arXiv preprint arXiv:2409.18301, 2024

work page arXiv 2024

[22] [22]

Frequency-domain masking and spatial interaction for deepfake detection,

X. Luo and Y . Wang, “Frequency-domain masking and spatial interaction for deepfake detection,”Electronics, vol. 14, no. 7, p. 1302, 2025

work page 2025

[23] [23]

A VFF: Audio-visual feature fusion for video deepfake detection,

T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “A VFF: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[24] [24]

Contextual cross-modal attention for audio-visual deepfake detection and localization,

V . S. Katamneni and A. Rattani, “Contextual cross-modal attention for audio-visual deepfake detection and localization,”arXiv preprint arXiv:2408.01532, 2024

work page arXiv 2024

[25] [25]

HFMF: Hierarchical fusion meets multi-stream models for deepfake detection,

A. Mehta, B. McArthur, N. Kolloju, and Z. Tu, “HFMF: Hierarchical fusion meets multi-stream models for deepfake detection,”arXiv preprint arXiv:2501.05631, 2025

work page arXiv 2025

[26] [26]

Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges,

P. Liu, Q. Tao, and J. T. Zhou, “Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges,”arXiv preprint arXiv:2406.06965, 2024

work page arXiv 2024

[27] [27]

A timely survey on vision transformer for deepfake detection,

Z. Wang, Z. Cheng, J. Xiong, X. Xu, T. Li, B. Veeravalli, and X. Yang, “A timely survey on vision transformer for deepfake detection,”arXiv preprint arXiv:2405.08463, 2024

work page arXiv 2024

[28] [28]

Exploring self-supervised vision transformers for deepfake detection: A comparative analysis,

H. H. Nguyen, J. Yamagishi, and I. Echizen, “Exploring self-supervised vision transformers for deepfake detection: A comparative analysis,” arXiv preprint arXiv:2405.00355, 2024

work page arXiv 2024

[29] [29]

Texture, shape and order matter: A new transformer design for sequential DeepFake detection,

Y . Li, Y . Li, X. Wang, B. Wu, J. Zhou, and J. Dong, “Texture, shape and order matter: A new transformer design for sequential DeepFake detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 202–211

work page 2025

[30] [30]

Learning spatiotemporal features with 3D convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497

work page 2015

[31] [31]

Quo vadis, action recognition? A new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724– 4733

work page 2017

[32] [32]

Recurrent convolutional strategies for face manipulation detection in videos,

E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natara- jan, “Recurrent convolutional strategies for face manipulation detection in videos,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019

work page 2019