Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Hao Jiang; Minghao Shao; Mingkun Xu; Shijie Zhang; Yusong Wang; Zhen Wang; Zheyuan Gu

arxiv: 2605.07398 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Zheyuan Gu , Minghao Shao , Zhen Wang , Yusong Wang , Mingkun Xu , Shijie Zhang , Hao Jiang This is my paper

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords deepfake detectiontemporal spectral attacksvideo forensicsadversarial robustnessshortcut learningspatiotemporal modelsspectral invarianceevasion attacks

0 comments

The pith

Deepfake video detectors overfit to fragile temporal spectrum cues and can be evaded by spectral attacks, while SpInShield forces reliance on stable semantic motion instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that high-performing spatiotemporal deepfake detectors actually depend on unstable temporal spectral features that attackers can easily manipulate. This overfitting leaves the models open to evasion even when they appear accurate on standard tests. SpInShield counters the problem by introducing a learnable spectral adversary that creates severe deformations during training and a shortcut suppression strategy that removes those manipulatable statistics from the model's latent space. A reader should care because real-world deepfakes will likely involve exactly these kinds of spectral tweaks, so detectors must learn causal motion patterns rather than brittle frequency shortcuts. If the approach holds, it would produce detectors that remain effective when adversaries target the temporal spectrum.

Core claim

Spatiotemporal deepfake detectors achieve high AUC scores yet remain susceptible to evasion because they overfit on fragile temporal spectrum cues instead of learning robust semantic causality. SpInShield addresses this by decoupling semantic motion from manipulatable spectral artifacts: a learnable spectral adversary dynamically synthesizes severe spectral deformations to simulate extreme attacks, and a shortcut suppression optimization compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space.

What carries the argument

The learnable spectral adversary, which dynamically generates severe spectral deformations to mimic extreme attacks, paired with shortcut suppression optimization that removes unstable spectral statistics from the latent representation.

If this is right

Models trained under SpInShield retain competitive AUC on standard deepfake datasets while showing substantially higher resistance to amplitude spectral attacks.
The encoder is forced to prioritize semantic motion causality over any spectral shortcuts that can be altered by an adversary.
The same training procedure can be applied to other video-based forensic tasks that currently rely on fragile frequency-domain cues.
Detectors become harder to evade because attackers must now alter the underlying semantic content rather than just the spectral profile.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar spectral vulnerabilities are likely present in other video understanding models that process motion, such as action recognition systems.
The defense suggests that any detection method relying on frequency statistics should be re-examined for shortcut learning before deployment.
Real-world validation would require applying the method to deepfakes generated by unknown future manipulation techniques rather than only simulated attacks.
The approach could extend to audio or multimodal deepfakes if analogous spectral instabilities exist in those domains.

Load-bearing premise

That the simulated spectral deformations accurately represent real attacker capabilities and that removing unstable spectral statistics leaves behind all the forensic information the detector actually needs.

What would settle it

A test set of deepfake videos subjected to real amplitude spectral modifications where SpInShield's AUC falls to the level of the strongest undefended baseline.

Figures

Figures reproduced from arXiv: 2605.07398 by Hao Jiang, Minghao Shao, Mingkun Xu, Shijie Zhang, Yusong Wang, Zhen Wang, Zheyuan Gu.

**Figure 1.** Figure 1: SLF [5] relies on temporal spectral cues for detection, and fails with misclassifications when these cues are suppressed. However, building robust detectors faces three challenges. First, separating malicious temporal spectrum artifacts from legitimate motion is complex. The temporal frequency of forgeries overlaps with genuine facial dynamics, such as microexpressions or blinking. Naively suppressing sp… view at source ↗

**Figure 2.** Figure 2: AUC under temporal notch suppression at representative DFT-bin frequencies. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The framework comprising four interconnected modules: Feature Extraction, Adversarial [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Quantitative and qualitative evaluation: (a) Joint impact of hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpInShield identifies overfitting to temporal spectral shortcuts in deepfake detectors and counters it with a learnable adversary plus shortcut suppression, but the big reported gains come only from attacks made by that same adversary.

read the letter

The paper's main point is that current spatiotemporal deepfake detectors latch onto fragile temporal spectrum cues and can be evaded, so SpInShield trains an encoder against a learnable spectral adversary that creates severe amplitude deformations while also suppressing unstable spectral statistics in the latent space. This is the new piece: the dynamic adversary paired with explicit shortcut purging to push the model toward semantic motion features instead. The abstract shows they get competitive numbers on standard datasets and a 21.30 point AUC lift under the simulated attacks, which at least demonstrates the idea can be implemented and measured. That is useful work on a practical problem. The soft spot is the evaluation. All the robustness results use attacks synthesized by the proposed learnable adversary during training. This setup risks measuring how well the defense beats its own training distribution rather than real evasion attempts from independent deepfake generators. Without tests against other attack methods or unaltered real-world temporal manipulations, the gain stays tied to the simulation. The suppression step also lacks detail on the exact instability criterion and whether useful forensic traces get removed along with the unstable ones. The abstract gives no experimental setup or baseline descriptions, so the comparison is hard to judge from what's here. This paper is for people working on video deepfake detection and adversarial robustness in computer vision. A reader who wants concrete mechanisms for forcing invariant features would find the framework worth looking at, even with the current limits on the evidence. It shows honest engagement with the vulnerability and a clear proposal, so it deserves a serious referee. I would send it to peer review rather than desk reject, with reviewers asked to check whether the attack distribution matches plausible real threats and whether the purged statistics lose signal.

Referee Report

3 major / 2 minor

Summary. The paper claims that spatiotemporal deepfake detectors overfit to fragile temporal spectrum cues rather than robust semantic features, making them vulnerable to evasion attacks. It proposes SpInShield, a temporal spectral-invariant defense that introduces a learnable spectral adversary to dynamically synthesize severe amplitude spectral deformations during training, combined with a shortcut suppression optimization to purge unstable spectral statistics from the latent space. The method is reported to achieve competitive performance on standard deepfake datasets while delivering a 21.30 percentage point AUC gain over the strongest baseline specifically under the simulated attacks generated by this adversary.

Significance. If the simulated attacks faithfully represent the distribution of real evasion attacks that deepfake generators can produce, SpInShield could offer a practical framework for building more robust detectors by enforcing invariance to manipulatable spectral artifacts. The learnable adversary approach for simulating extreme scenarios is a potentially useful training-time augmentation technique, though its value depends on independent validation beyond the training distribution.

major comments (3)

[Abstract / Experimental Evaluation] Abstract and experimental results: the headline 21.30 pp AUC improvement is reported exclusively under 'simulated amplitude spectral attacks' generated by the same learnable spectral adversary used during training. This creates a potential circularity risk; the evaluation does not demonstrate robustness against independent real-world temporal manipulations or fixed non-learnable attacks, so the gain may reflect overfitting to the adversary's output distribution rather than genuine invariance.
[Method / Shortcut Suppression Optimization] Shortcut suppression strategy: the description of 'purging unstable spectral statistics' lacks an explicit, reproducible definition or criterion (e.g., variance threshold, gradient norm, or statistical test). Without this and an ablation confirming that the purged features do not contain stable forensic cues under other perturbations, it remains unclear whether useful detection signal is being discarded.
[Experiments] Experimental setup: the abstract and results provide no details on the specific baselines compared, the datasets and attack parameters used to train/validate the learnable adversary, or how the 'widely used datasets' were split for the robustness experiments. This limits verification of the central claim and reproducibility.

minor comments (2)

[Method] Notation for spectral components (e.g., amplitude vs. phase) could be clarified with explicit equations or diagrams to avoid ambiguity in the temporal spectrum discussion.
[Abstract] The abstract mentions 'competitive performance on widely used datasets' but does not name the datasets or report the corresponding AUC numbers; adding a summary table would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that enhance the clarity, reproducibility, and strength of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract / Experimental Evaluation] Abstract and experimental results: the headline 21.30 pp AUC improvement is reported exclusively under 'simulated amplitude spectral attacks' generated by the same learnable spectral adversary used during training. This creates a potential circularity risk; the evaluation does not demonstrate robustness against independent real-world temporal manipulations or fixed non-learnable attacks, so the gain may reflect overfitting to the adversary's output distribution rather than genuine invariance.

Authors: We acknowledge the validity of the circularity concern. The learnable adversary is deliberately trained to generate severe deformations as a worst-case training augmentation, and the reported gain under its distribution validates the shortcut-suppression objective. However, this does not fully substitute for evaluation on independent attacks. In the revision we will add results on fixed (non-learnable) amplitude spectral perturbations and at least one additional temporal manipulation method drawn from the literature, using the same evaluation protocol. These new experiments will be reported alongside the existing adversary-based results. revision: yes
Referee: [Method / Shortcut Suppression Optimization] Shortcut suppression strategy: the description of 'purging unstable spectral statistics' lacks an explicit, reproducible definition or criterion (e.g., variance threshold, gradient norm, or statistical test). Without this and an ablation confirming that the purged features do not contain stable forensic cues under other perturbations, it remains unclear whether useful detection signal is being discarded.

Authors: We agree that the current description of the shortcut suppression optimization is insufficiently precise. The revised manuscript will include the exact loss formulation, the criterion used to identify unstable spectral statistics (a variance-based threshold computed over the batch in the frequency domain), and the optimization schedule. We will also add an ablation that measures detection performance when the suppression term is removed or replaced by random feature dropout, under both the original and additional perturbation sets, to confirm that stable forensic cues are retained. revision: yes
Referee: [Experiments] Experimental setup: the abstract and results provide no details on the specific baselines compared, the datasets and attack parameters used to train/validate the learnable adversary, or how the 'widely used datasets' were split for the robustness experiments. This limits verification of the central claim and reproducibility.

Authors: We accept this criticism. The revised experimental section will explicitly list all baselines with their original references and hyper-parameters, name the datasets (FaceForensics++, Celeb-DF, DFDC) together with the exact train/validation/test splits and preprocessing, and provide the full training protocol and hyper-parameters for the learnable spectral adversary (including deformation severity ranges and optimization settings). All robustness experiments will be described with the same level of detail. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces SpInShield with a learnable spectral adversary for training and reports performance gains under the resulting simulated attacks. This follows standard adversarial training and evaluation protocols without reducing claims to definitional equivalence or fitted inputs by construction. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would force the central results (competitive AUC on standard datasets and +21.30 pp under simulated attacks) to collapse into the method's own inputs. The experimental comparisons remain independent of any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level proposal of the learnable spectral adversary.

invented entities (1)

learnable spectral adversary no independent evidence
purpose: dynamically synthesizes severe spectral deformations to simulate extreme attack scenarios
New component introduced to train against spectral attacks

pith-pipeline@v0.9.0 · 5438 in / 965 out tokens · 38198 ms · 2026-05-11T01:52:55.427207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. URL https://api.semanticscholar. org/CorpusID:232417054

work page 2021
[2]

A ConvNet for the 2020s

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18689–18698, 2022. doi: 10.1109/CVPR52688.2022.01815

work page doi:10.1109/cvpr52688.2022.01815 2022
[3]

Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

work page 2024
[4]

Chesney and Danielle Keats Citron

Robert M. Chesney and Danielle Keats Citron. Deep fakes: A looming challenge for privacy, democracy, and national security.California Law Review, 107:1753, 2018. URL https: //api.semanticscholar.org/CorpusID:158865631

work page 2018
[5]

Ex- ploiting style latent flows for generalizing deepfake video detection

Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, and Jongwon Choi. Ex- ploiting style latent flows for generalizing deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1133–1143, 2024

work page 2024
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar. org/CorpusI...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Contributing data to deepfake detection research

Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. https:// ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html , 9 2019. Google AI Blog. Accessed: 2023-07-30

work page 2019
[8]

Fourier spectrum discrepancies in deep network generated images,

Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images, 2020. URLhttps://arxiv.org/abs/1911.06465

work page arXiv 2020
[9]

David Field and Damon Chandler. Method for estimating the relative contribution of phase and power spectra to the total information in natural-scene patches.Journal of the Optical Society of America A, 29:55–67, 12 2011. doi: 10.1364/JOSAA.29.000055

work page doi:10.1364/josaa.29.000055 2011
[10]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

work page 2020
[11]

Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann. Shortcut learning in deep neural networks.Na- ture Machine Intelligence, 2:665 – 673, 2020. URL https://api.semanticscholar.org/ CorpusID:215786368

work page 2020
[12]

David Güera and Edward J. Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (A VSS), pages 1–6, 2018. doi: 10.1109/A VSS.2018.8639163

work page doi:10.1109/a 2018
[13]

Towards more general video-based deepfake detection through facial component guided adaptation for foundation model

Yue Hua Han, Tai Ming Huang, Kai Lung Hua, and Jun Cheng Chen. Towards more general video-based deepfake detection through facial component guided adaptation for foundation model. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[14]

Pintea, Pascal S

Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, and Jan C. van Gemert. Using phase instead of optical flow for action recognition, 2018. URL https://arxiv.org/abs/1809. 03258. 10

work page 2018
[15]

Depth-aware generative adversarial network for talking head video generation

Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. 2022

work page 2022
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. Implicit identity driven deepfake face swapping detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4490–4499, 2023. doi: 10.1109/ CVPR52729.2023.00436

work page arXiv 2023
[17]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020. URL https://arxiv.org/ abs/2001.03024

work page arXiv 2020
[18]

A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. URL https://api.semanticscholar.org/CorpusID: 54482423

work page 2019
[19]

Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection

Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, and Jongwon Choi. Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11198–11207, October 2025

work page 2025
[20]

Davis E. King. Dlib-ml: A machine learning toolkit.J. Mach. Learn. Res., 10:1755–1758, December 2009. ISSN 1532-4435

work page 2009
[21]

Freqblender: enhancing deepfake detection by blending frequency knowledge

Hanzhe Li, Jiaran Zhou, Yuezun Li, Baoyuan Wu, Bin Li, and Junyu Dong. Freqblender: enhancing deepfake detection by blending frequency knowledge. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

work page 2024
[22]

Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics

Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, United States, 2020

work page 2020
[23]

Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, and Qiushi Li. Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

work page
[24]

URLhttps://arxiv.org/abs/2409.14444

work page arXiv
[25]

Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 772–781, 2021. URL https://api.semanticscholar.org/CorpusID:232092167

work page 2021
[26]

Momina Masood, M. M. Tanzim Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward.Applied Intelligence, 53:3974–4026, 2021. URL https://api.semanticscholar.org/CorpusID:232075890

work page 2021
[27]

The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020. URL https://api.semanticscholar.org/CorpusID: 216080410

work page 2020
[28]

Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J

Daniel Mas Montserrat, Hanxiang Hao, Sri Kalyan Yarlagadda, Sriram Baireddy, Ruiting Shao, János Horváth, Emily R. Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J. Delp. Deepfakes detection with automatic face weighting.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2851– 2859, 2020. URL...

work page 2020
[29]

Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection

Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2025. 11

work page 2025
[30]

Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey.Computer Vision and Image Understanding, 223:103525, 2022. ISSN 1077-3142. doi: https://doi.org/10. 1016/j.cviu.2022.103525. URL h...

work page arXiv 2022
[31]

Oppenheim and J.S

A.V . Oppenheim and J.S. Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981. doi: 10.1109/PROC.1981.12022

work page doi:10.1109/proc.1981.12022 1981
[32]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII, page 86–103, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58609-6. doi: 10.1007/ 978-3-03...

work page doi:10.1007/978-3-030-58610-2_6 2020
[33]

Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019. URL https://api. semanticscholar.org/CorpusID:59292011

work page 2019
[34]

Analysis and visualization of temporal variations in video

Michael Rubinstein. Analysis and visualization of temporal variations in video. 2014. URL https://api.semanticscholar.org/CorpusID:41891254

work page 2014
[35]

A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024

Sefik Serengil and Alper Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024. doi: 10.17671/gazibtd.1399077. URL https://dergipark.org.tr/en/pub/gazibtd/issue/ 84331/1399077

work page doi:10.17671/gazibtd.1399077 2024
[36]

A ConvNet for the 2020s

Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18699–18708, 2022. doi: 10.1109/CVPR52688.2022.01816

work page doi:10.1109/cvpr52688.2022.01816 2022
[37]

Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020

Rubén Tolosana, Rubén Vera-Rodríguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020. URL https://api.semanticscholar.org/CorpusID: 209531954

work page arXiv 2001
[38]

Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks.2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2014. URL https://api.semanticscholar. org/CorpusID:1122604

work page 2015
[39]

Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020

Luisa Verdoliva. Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020. URL https://api.semanticscholar. org/CorpusID:210838881

work page 2020
[40]

Neal Wadhwa, Michael Rubinstein, Frédo Durand, and William T. Freeman. Phase-based video motion processing.ACM Trans. Graph., 32(4), July 2013. ISSN 0730-0301. doi: 10.1145/2461912.2461966. URLhttps://doi.org/10.1145/2461912.2461966

work page doi:10.1145/2461912.2461966 2013
[41]

Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023. URLhttps://api.semanticscholar.org/CorpusID:257805127

work page 2023
[42]

Exposing digital forgeries in video by detecting double mpeg compression

Weihong Wang and Hany Farid. Exposing digital forgeries in video by detecting double mpeg compression. InProceedings of the 8th Workshop on Multimedia and Security, MM&Sec ’06, page 37–47, New York, NY , USA, 2006. Association for Computing Machinery. ISBN 1595934936. doi: 10.1145/1161366.1161375. URL https://doi.org/10.1145/1161366. 1161375. 12

work page doi:10.1145/1161366.1161375 2006
[43]

Yan Wang, Qindong Sun, Dongzhu Rong, and Rong Geng. Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts.Computer Vision and Image Understanding, 247:104072, 2024. ISSN 1077-

work page 2024
[44]

URL https://www.sciencedirect

doi: https://doi.org/10.1016/j.cviu.2024.104072. URL https://www.sciencedirect. com/science/article/pii/S107731422400153X

work page doi:10.1016/j.cviu.2024.104072 2024
[45]

Interactive editing of deformable simulations , year =

Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand, and William Freeman. Eulerian video magnification for revealing subtle changes in the world.ACM Trans. Graph., 31(4), July 2012. ISSN 0730-0301. doi: 10.1145/2185520.2185561. URL https://doi.org/10.1145/2185520.2185561

work page doi:10.1145/2185520.2185561 2012
[46]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

work page 2023
[47]

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8984–8994, 2023. URLhttps://api.semanticscholar.org/CorpusID:265294623

work page 2024
[48]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22355–22366, 2023. doi: 10.1109/ICCV51070.2023.02048

work page doi:10.1109/iccv51070.2023.02048 2023
[49]

Orthogonal subspace decomposition for generalizable ai-generated image detection

Zhiyuan Yan, Jiangming Wang, Zhendong Wang, Peng Jin, Ke-Yue Zhang, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InInternational Conference on Machine Learning,

work page
[50]

URLhttps://api.semanticscholar.org/CorpusID:274234236

work page
[51]

Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, and Li Yuan. Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotem- poral adapter tuning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12615–12625, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272310564

work page 2025
[52]

Benchmarking the robustness of spatial-temporal models against corruptions, 2022

Chenyu Yi, Siyuan Yang, Haoliang Li, Yap peng Tan, and Alex Kot. Benchmarking the robustness of spatial-temporal models against corruptions, 2022. URL https://arxiv.org/ abs/2110.06513

work page arXiv 2022
[53]

Cubuk, and Justin Gilmer

Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision, 2020. URLhttps://arxiv.org/ abs/1906.08988

work page arXiv 2020
[54]

Exploring temporal coherence for more general video face forgery detection

Zheng Yinglin, Bao Jianmin, Chen Dong, Zeng Ming, and Wen Fang. Exploring temporal coherence for more general video face forgery detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021

work page 2021
[55]

Learning natural consistency representation for face forgery video detection

Daichi Zhang, Zihao Xiao, Shikun Li, Fanzhao Lin, Jianmin Li, and Shiming Ge. Learning natural consistency representation for face forgery video detection. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXXIII, page 407–424, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031...

work page doi:10.1007/978-3-031-73010-8_24 2024
[56]

Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

work page 2023
[57]

Altfreezing for more general video face forgery detection

Wang Zhendong, Bao Jianmin, Zhou Wengang, Wang Weilun, and Li Houqiang. Altfreezing for more general video face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4129–4138, June 2023

work page 2023
[58]

Wilddeepfake: A challenging real-world dataset for deepfake detection

Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. InProceedings of the 28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 13

work page 2020

[1] [1]

Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. URL https://api.semanticscholar. org/CorpusID:232417054

work page 2021

[2] [2]

A ConvNet for the 2020s

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18689–18698, 2022. doi: 10.1109/CVPR52688.2022.01815

work page doi:10.1109/cvpr52688.2022.01815 2022

[3] [3]

Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

work page 2024

[4] [4]

Chesney and Danielle Keats Citron

Robert M. Chesney and Danielle Keats Citron. Deep fakes: A looming challenge for privacy, democracy, and national security.California Law Review, 107:1753, 2018. URL https: //api.semanticscholar.org/CorpusID:158865631

work page 2018

[5] [5]

Ex- ploiting style latent flows for generalizing deepfake video detection

Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, and Jongwon Choi. Ex- ploiting style latent flows for generalizing deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1133–1143, 2024

work page 2024

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar. org/CorpusI...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Contributing data to deepfake detection research

Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. https:// ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html , 9 2019. Google AI Blog. Accessed: 2023-07-30

work page 2019

[8] [8]

Fourier spectrum discrepancies in deep network generated images,

Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images, 2020. URLhttps://arxiv.org/abs/1911.06465

work page arXiv 2020

[9] [9]

David Field and Damon Chandler. Method for estimating the relative contribution of phase and power spectra to the total information in natural-scene patches.Journal of the Optical Society of America A, 29:55–67, 12 2011. doi: 10.1364/JOSAA.29.000055

work page doi:10.1364/josaa.29.000055 2011

[10] [10]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

work page 2020

[11] [11]

Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann. Shortcut learning in deep neural networks.Na- ture Machine Intelligence, 2:665 – 673, 2020. URL https://api.semanticscholar.org/ CorpusID:215786368

work page 2020

[12] [12]

David Güera and Edward J. Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (A VSS), pages 1–6, 2018. doi: 10.1109/A VSS.2018.8639163

work page doi:10.1109/a 2018

[13] [13]

Towards more general video-based deepfake detection through facial component guided adaptation for foundation model

Yue Hua Han, Tai Ming Huang, Kai Lung Hua, and Jun Cheng Chen. Towards more general video-based deepfake detection through facial component guided adaptation for foundation model. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[14] [14]

Pintea, Pascal S

Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, and Jan C. van Gemert. Using phase instead of optical flow for action recognition, 2018. URL https://arxiv.org/abs/1809. 03258. 10

work page 2018

[15] [15]

Depth-aware generative adversarial network for talking head video generation

Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. 2022

work page 2022

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. Implicit identity driven deepfake face swapping detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4490–4499, 2023. doi: 10.1109/ CVPR52729.2023.00436

work page arXiv 2023

[17] [17]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020. URL https://arxiv.org/ abs/2001.03024

work page arXiv 2020

[18] [18]

A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. URL https://api.semanticscholar.org/CorpusID: 54482423

work page 2019

[19] [19]

Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection

Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, and Jongwon Choi. Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11198–11207, October 2025

work page 2025

[20] [20]

Davis E. King. Dlib-ml: A machine learning toolkit.J. Mach. Learn. Res., 10:1755–1758, December 2009. ISSN 1532-4435

work page 2009

[21] [21]

Freqblender: enhancing deepfake detection by blending frequency knowledge

Hanzhe Li, Jiaran Zhou, Yuezun Li, Baoyuan Wu, Bin Li, and Junyu Dong. Freqblender: enhancing deepfake detection by blending frequency knowledge. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

work page 2024

[22] [22]

Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics

Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, United States, 2020

work page 2020

[23] [23]

Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, and Qiushi Li. Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

work page

[24] [24]

URLhttps://arxiv.org/abs/2409.14444

work page arXiv

[25] [25]

Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 772–781, 2021. URL https://api.semanticscholar.org/CorpusID:232092167

work page 2021

[26] [26]

Momina Masood, M. M. Tanzim Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward.Applied Intelligence, 53:3974–4026, 2021. URL https://api.semanticscholar.org/CorpusID:232075890

work page 2021

[27] [27]

The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020. URL https://api.semanticscholar.org/CorpusID: 216080410

work page 2020

[28] [28]

Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J

Daniel Mas Montserrat, Hanxiang Hao, Sri Kalyan Yarlagadda, Sriram Baireddy, Ruiting Shao, János Horváth, Emily R. Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J. Delp. Deepfakes detection with automatic face weighting.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2851– 2859, 2020. URL...

work page 2020

[29] [29]

Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection

Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2025. 11

work page 2025

[30] [30]

Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey.Computer Vision and Image Understanding, 223:103525, 2022. ISSN 1077-3142. doi: https://doi.org/10. 1016/j.cviu.2022.103525. URL h...

work page arXiv 2022

[31] [31]

Oppenheim and J.S

A.V . Oppenheim and J.S. Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981. doi: 10.1109/PROC.1981.12022

work page doi:10.1109/proc.1981.12022 1981

[32] [32]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII, page 86–103, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58609-6. doi: 10.1007/ 978-3-03...

work page doi:10.1007/978-3-030-58610-2_6 2020

[33] [33]

Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019. URL https://api. semanticscholar.org/CorpusID:59292011

work page 2019

[34] [34]

Analysis and visualization of temporal variations in video

Michael Rubinstein. Analysis and visualization of temporal variations in video. 2014. URL https://api.semanticscholar.org/CorpusID:41891254

work page 2014

[35] [35]

A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024

Sefik Serengil and Alper Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024. doi: 10.17671/gazibtd.1399077. URL https://dergipark.org.tr/en/pub/gazibtd/issue/ 84331/1399077

work page doi:10.17671/gazibtd.1399077 2024

[36] [36]

A ConvNet for the 2020s

Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18699–18708, 2022. doi: 10.1109/CVPR52688.2022.01816

work page doi:10.1109/cvpr52688.2022.01816 2022

[37] [37]

Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020

Rubén Tolosana, Rubén Vera-Rodríguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020. URL https://api.semanticscholar.org/CorpusID: 209531954

work page arXiv 2001

[38] [38]

Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks.2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2014. URL https://api.semanticscholar. org/CorpusID:1122604

work page 2015

[39] [39]

Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020

Luisa Verdoliva. Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020. URL https://api.semanticscholar. org/CorpusID:210838881

work page 2020

[40] [40]

Neal Wadhwa, Michael Rubinstein, Frédo Durand, and William T. Freeman. Phase-based video motion processing.ACM Trans. Graph., 32(4), July 2013. ISSN 0730-0301. doi: 10.1145/2461912.2461966. URLhttps://doi.org/10.1145/2461912.2461966

work page doi:10.1145/2461912.2461966 2013

[41] [41]

Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023. URLhttps://api.semanticscholar.org/CorpusID:257805127

work page 2023

[42] [42]

Exposing digital forgeries in video by detecting double mpeg compression

Weihong Wang and Hany Farid. Exposing digital forgeries in video by detecting double mpeg compression. InProceedings of the 8th Workshop on Multimedia and Security, MM&Sec ’06, page 37–47, New York, NY , USA, 2006. Association for Computing Machinery. ISBN 1595934936. doi: 10.1145/1161366.1161375. URL https://doi.org/10.1145/1161366. 1161375. 12

work page doi:10.1145/1161366.1161375 2006

[43] [43]

Yan Wang, Qindong Sun, Dongzhu Rong, and Rong Geng. Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts.Computer Vision and Image Understanding, 247:104072, 2024. ISSN 1077-

work page 2024

[44] [44]

URL https://www.sciencedirect

doi: https://doi.org/10.1016/j.cviu.2024.104072. URL https://www.sciencedirect. com/science/article/pii/S107731422400153X

work page doi:10.1016/j.cviu.2024.104072 2024

[45] [45]

Interactive editing of deformable simulations , year =

Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand, and William Freeman. Eulerian video magnification for revealing subtle changes in the world.ACM Trans. Graph., 31(4), July 2012. ISSN 0730-0301. doi: 10.1145/2185520.2185561. URL https://doi.org/10.1145/2185520.2185561

work page doi:10.1145/2185520.2185561 2012

[46] [46]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

work page 2023

[47] [47]

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8984–8994, 2023. URLhttps://api.semanticscholar.org/CorpusID:265294623

work page 2024

[48] [48]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22355–22366, 2023. doi: 10.1109/ICCV51070.2023.02048

work page doi:10.1109/iccv51070.2023.02048 2023

[49] [49]

Orthogonal subspace decomposition for generalizable ai-generated image detection

Zhiyuan Yan, Jiangming Wang, Zhendong Wang, Peng Jin, Ke-Yue Zhang, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InInternational Conference on Machine Learning,

work page

[50] [50]

URLhttps://api.semanticscholar.org/CorpusID:274234236

work page

[51] [51]

Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, and Li Yuan. Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotem- poral adapter tuning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12615–12625, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272310564

work page 2025

[52] [52]

Benchmarking the robustness of spatial-temporal models against corruptions, 2022

Chenyu Yi, Siyuan Yang, Haoliang Li, Yap peng Tan, and Alex Kot. Benchmarking the robustness of spatial-temporal models against corruptions, 2022. URL https://arxiv.org/ abs/2110.06513

work page arXiv 2022

[53] [53]

Cubuk, and Justin Gilmer

Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision, 2020. URLhttps://arxiv.org/ abs/1906.08988

work page arXiv 2020

[54] [54]

Exploring temporal coherence for more general video face forgery detection

Zheng Yinglin, Bao Jianmin, Chen Dong, Zeng Ming, and Wen Fang. Exploring temporal coherence for more general video face forgery detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021

work page 2021

[55] [55]

Learning natural consistency representation for face forgery video detection

Daichi Zhang, Zihao Xiao, Shikun Li, Fanzhao Lin, Jianmin Li, and Shiming Ge. Learning natural consistency representation for face forgery video detection. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXXIII, page 407–424, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031...

work page doi:10.1007/978-3-031-73010-8_24 2024

[56] [56]

Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

work page 2023

[57] [57]

Altfreezing for more general video face forgery detection

Wang Zhendong, Bao Jianmin, Zhou Wengang, Wang Weilun, and Li Houqiang. Altfreezing for more general video face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4129–4138, June 2023

work page 2023

[58] [58]

Wilddeepfake: A challenging real-world dataset for deepfake detection

Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. InProceedings of the 28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 13

work page 2020