pith. sign in

arxiv: 2605.07398 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords deepfake detectiontemporal spectral attacksvideo forensicsadversarial robustnessshortcut learningspatiotemporal modelsspectral invarianceevasion attacks
0
0 comments X

The pith

Deepfake video detectors overfit to fragile temporal spectrum cues and can be evaded by spectral attacks, while SpInShield forces reliance on stable semantic motion instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that high-performing spatiotemporal deepfake detectors actually depend on unstable temporal spectral features that attackers can easily manipulate. This overfitting leaves the models open to evasion even when they appear accurate on standard tests. SpInShield counters the problem by introducing a learnable spectral adversary that creates severe deformations during training and a shortcut suppression strategy that removes those manipulatable statistics from the model's latent space. A reader should care because real-world deepfakes will likely involve exactly these kinds of spectral tweaks, so detectors must learn causal motion patterns rather than brittle frequency shortcuts. If the approach holds, it would produce detectors that remain effective when adversaries target the temporal spectrum.

Core claim

Spatiotemporal deepfake detectors achieve high AUC scores yet remain susceptible to evasion because they overfit on fragile temporal spectrum cues instead of learning robust semantic causality. SpInShield addresses this by decoupling semantic motion from manipulatable spectral artifacts: a learnable spectral adversary dynamically synthesizes severe spectral deformations to simulate extreme attacks, and a shortcut suppression optimization compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space.

What carries the argument

The learnable spectral adversary, which dynamically generates severe spectral deformations to mimic extreme attacks, paired with shortcut suppression optimization that removes unstable spectral statistics from the latent representation.

If this is right

  • Models trained under SpInShield retain competitive AUC on standard deepfake datasets while showing substantially higher resistance to amplitude spectral attacks.
  • The encoder is forced to prioritize semantic motion causality over any spectral shortcuts that can be altered by an adversary.
  • The same training procedure can be applied to other video-based forensic tasks that currently rely on fragile frequency-domain cues.
  • Detectors become harder to evade because attackers must now alter the underlying semantic content rather than just the spectral profile.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar spectral vulnerabilities are likely present in other video understanding models that process motion, such as action recognition systems.
  • The defense suggests that any detection method relying on frequency statistics should be re-examined for shortcut learning before deployment.
  • Real-world validation would require applying the method to deepfakes generated by unknown future manipulation techniques rather than only simulated attacks.
  • The approach could extend to audio or multimodal deepfakes if analogous spectral instabilities exist in those domains.

Load-bearing premise

That the simulated spectral deformations accurately represent real attacker capabilities and that removing unstable spectral statistics leaves behind all the forensic information the detector actually needs.

What would settle it

A test set of deepfake videos subjected to real amplitude spectral modifications where SpInShield's AUC falls to the level of the strongest undefended baseline.

Figures

Figures reproduced from arXiv: 2605.07398 by Hao Jiang, Minghao Shao, Mingkun Xu, Shijie Zhang, Yusong Wang, Zhen Wang, Zheyuan Gu.

Figure 1
Figure 1. Figure 1: SLF [5] relies on temporal spectral cues for detection, and fails with misclassifications when these cues are suppressed. However, building robust detectors faces three challenges. First, separating malicious temporal spectrum artifacts from legitimate motion is complex. The temporal frequency of forg￾eries overlaps with genuine facial dynamics, such as micro￾expressions or blinking. Naively suppressing sp… view at source ↗
Figure 2
Figure 2. Figure 2: AUC under temporal notch suppression at representative DFT-bin frequencies. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework comprising four interconnected modules: Feature Extraction, Adversarial [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative and qualitative evaluation: (a) Joint impact of hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that spatiotemporal deepfake detectors overfit to fragile temporal spectrum cues rather than robust semantic features, making them vulnerable to evasion attacks. It proposes SpInShield, a temporal spectral-invariant defense that introduces a learnable spectral adversary to dynamically synthesize severe amplitude spectral deformations during training, combined with a shortcut suppression optimization to purge unstable spectral statistics from the latent space. The method is reported to achieve competitive performance on standard deepfake datasets while delivering a 21.30 percentage point AUC gain over the strongest baseline specifically under the simulated attacks generated by this adversary.

Significance. If the simulated attacks faithfully represent the distribution of real evasion attacks that deepfake generators can produce, SpInShield could offer a practical framework for building more robust detectors by enforcing invariance to manipulatable spectral artifacts. The learnable adversary approach for simulating extreme scenarios is a potentially useful training-time augmentation technique, though its value depends on independent validation beyond the training distribution.

major comments (3)
  1. [Abstract / Experimental Evaluation] Abstract and experimental results: the headline 21.30 pp AUC improvement is reported exclusively under 'simulated amplitude spectral attacks' generated by the same learnable spectral adversary used during training. This creates a potential circularity risk; the evaluation does not demonstrate robustness against independent real-world temporal manipulations or fixed non-learnable attacks, so the gain may reflect overfitting to the adversary's output distribution rather than genuine invariance.
  2. [Method / Shortcut Suppression Optimization] Shortcut suppression strategy: the description of 'purging unstable spectral statistics' lacks an explicit, reproducible definition or criterion (e.g., variance threshold, gradient norm, or statistical test). Without this and an ablation confirming that the purged features do not contain stable forensic cues under other perturbations, it remains unclear whether useful detection signal is being discarded.
  3. [Experiments] Experimental setup: the abstract and results provide no details on the specific baselines compared, the datasets and attack parameters used to train/validate the learnable adversary, or how the 'widely used datasets' were split for the robustness experiments. This limits verification of the central claim and reproducibility.
minor comments (2)
  1. [Method] Notation for spectral components (e.g., amplitude vs. phase) could be clarified with explicit equations or diagrams to avoid ambiguity in the temporal spectrum discussion.
  2. [Abstract] The abstract mentions 'competitive performance on widely used datasets' but does not name the datasets or report the corresponding AUC numbers; adding a summary table would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that enhance the clarity, reproducibility, and strength of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] Abstract and experimental results: the headline 21.30 pp AUC improvement is reported exclusively under 'simulated amplitude spectral attacks' generated by the same learnable spectral adversary used during training. This creates a potential circularity risk; the evaluation does not demonstrate robustness against independent real-world temporal manipulations or fixed non-learnable attacks, so the gain may reflect overfitting to the adversary's output distribution rather than genuine invariance.

    Authors: We acknowledge the validity of the circularity concern. The learnable adversary is deliberately trained to generate severe deformations as a worst-case training augmentation, and the reported gain under its distribution validates the shortcut-suppression objective. However, this does not fully substitute for evaluation on independent attacks. In the revision we will add results on fixed (non-learnable) amplitude spectral perturbations and at least one additional temporal manipulation method drawn from the literature, using the same evaluation protocol. These new experiments will be reported alongside the existing adversary-based results. revision: yes

  2. Referee: [Method / Shortcut Suppression Optimization] Shortcut suppression strategy: the description of 'purging unstable spectral statistics' lacks an explicit, reproducible definition or criterion (e.g., variance threshold, gradient norm, or statistical test). Without this and an ablation confirming that the purged features do not contain stable forensic cues under other perturbations, it remains unclear whether useful detection signal is being discarded.

    Authors: We agree that the current description of the shortcut suppression optimization is insufficiently precise. The revised manuscript will include the exact loss formulation, the criterion used to identify unstable spectral statistics (a variance-based threshold computed over the batch in the frequency domain), and the optimization schedule. We will also add an ablation that measures detection performance when the suppression term is removed or replaced by random feature dropout, under both the original and additional perturbation sets, to confirm that stable forensic cues are retained. revision: yes

  3. Referee: [Experiments] Experimental setup: the abstract and results provide no details on the specific baselines compared, the datasets and attack parameters used to train/validate the learnable adversary, or how the 'widely used datasets' were split for the robustness experiments. This limits verification of the central claim and reproducibility.

    Authors: We accept this criticism. The revised experimental section will explicitly list all baselines with their original references and hyper-parameters, name the datasets (FaceForensics++, Celeb-DF, DFDC) together with the exact train/validation/test splits and preprocessing, and provide the full training protocol and hyper-parameters for the learnable spectral adversary (including deformation severity ranges and optimization settings). All robustness experiments will be described with the same level of detail. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces SpInShield with a learnable spectral adversary for training and reports performance gains under the resulting simulated attacks. This follows standard adversarial training and evaluation protocols without reducing claims to definitional equivalence or fitted inputs by construction. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would force the central results (competitive AUC on standard datasets and +21.30 pp under simulated attacks) to collapse into the method's own inputs. The experimental comparisons remain independent of any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level proposal of the learnable spectral adversary.

invented entities (1)
  • learnable spectral adversary no independent evidence
    purpose: dynamically synthesizes severe spectral deformations to simulate extreme attack scenarios
    New component introduced to train against spectral attacks

pith-pipeline@v0.9.0 · 5438 in / 965 out tokens · 38198 ms · 2026-05-11T01:52:55.427207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. URL https://api.semanticscholar. org/CorpusID:232417054

  2. [2]

    A ConvNet for the 2020s

    Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18689–18698, 2022. doi: 10.1109/CVPR52688.2022.01815

  3. [3]

    Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

  4. [4]

    Chesney and Danielle Keats Citron

    Robert M. Chesney and Danielle Keats Citron. Deep fakes: A looming challenge for privacy, democracy, and national security.California Law Review, 107:1753, 2018. URL https: //api.semanticscholar.org/CorpusID:158865631

  5. [5]

    Ex- ploiting style latent flows for generalizing deepfake video detection

    Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, and Jongwon Choi. Ex- ploiting style latent flows for generalizing deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1133–1143, 2024

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar. org/CorpusI...

  7. [7]

    Contributing data to deepfake detection research

    Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. https:// ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html , 9 2019. Google AI Blog. Accessed: 2023-07-30

  8. [8]

    Fourier spectrum discrepancies in deep network generated images,

    Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images, 2020. URLhttps://arxiv.org/abs/1911.06465

  9. [9]

    David Field and Damon Chandler. Method for estimating the relative contribution of phase and power spectra to the total information in natural-scene patches.Journal of the Optical Society of America A, 29:55–67, 12 2011. doi: 10.1364/JOSAA.29.000055

  10. [10]

    Leveraging frequency analysis for deep fake image recognition

    Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

  11. [11]

    Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix Wichmann. Shortcut learning in deep neural networks.Na- ture Machine Intelligence, 2:665 – 673, 2020. URL https://api.semanticscholar.org/ CorpusID:215786368

  12. [12]

    David Güera and Edward J. Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (A VSS), pages 1–6, 2018. doi: 10.1109/A VSS.2018.8639163

  13. [13]

    Towards more general video-based deepfake detection through facial component guided adaptation for foundation model

    Yue Hua Han, Tai Ming Huang, Kai Lung Hua, and Jun Cheng Chen. Towards more general video-based deepfake detection through facial component guided adaptation for foundation model. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  14. [14]

    Pintea, Pascal S

    Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, and Jan C. van Gemert. Using phase instead of optical flow for action recognition, 2018. URL https://arxiv.org/abs/1809. 03258. 10

  15. [15]

    Depth-aware generative adversarial network for talking head video generation

    Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. 2022

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. Implicit identity driven deepfake face swapping detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4490–4499, 2023. doi: 10.1109/ CVPR52729.2023.00436

  17. [17]

    Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020

    Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection, 2020. URL https://arxiv.org/ abs/2001.03024

  18. [18]

    A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. URL https://api.semanticscholar.org/CorpusID: 54482423

  19. [19]

    Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection

    Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, and Jongwon Choi. Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11198–11207, October 2025

  20. [20]

    Davis E. King. Dlib-ml: A machine learning toolkit.J. Mach. Learn. Res., 10:1755–1758, December 2009. ISSN 1532-4435

  21. [21]

    Freqblender: enhancing deepfake detection by blending frequency knowledge

    Hanzhe Li, Jiaran Zhou, Yuezun Li, Baoyuan Wu, Bin Li, and Junyu Dong. Freqblender: enhancing deepfake detection by blending frequency knowledge. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

  22. [22]

    Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics

    Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, United States, 2020

  23. [23]

    Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

    Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, and Qiushi Li. Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection,

  24. [24]

    URLhttps://arxiv.org/abs/2409.14444

  25. [25]

    Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 772–781, 2021. URL https://api.semanticscholar.org/CorpusID:232092167

  26. [26]

    Momina Masood, M. M. Tanzim Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward.Applied Intelligence, 53:3974–4026, 2021. URL https://api.semanticscholar.org/CorpusID:232075890

  27. [27]

    The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020

    Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes.ACM Computing Surveys (CSUR), 54:1 – 41, 2020. URL https://api.semanticscholar.org/CorpusID: 216080410

  28. [28]

    Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J

    Daniel Mas Montserrat, Hanxiang Hao, Sri Kalyan Yarlagadda, Sriram Baireddy, Ruiting Shao, János Horváth, Emily R. Bartusiak, Justin Yang, David Guera, Fengqing Maggie Zhu, and Edward J. Delp. Deepfakes detection with automatic face weighting.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2851– 2859, 2020. URL...

  29. [29]

    Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection

    Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Vulnerability- aware spatio-temporal learning for generalizable deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2025. 11

  30. [30]

    Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey.Computer Vision and Image Understanding, 223:103525, 2022. ISSN 1077-3142. doi: https://doi.org/10. 1016/j.cviu.2022.103525. URL h...

  31. [31]

    Oppenheim and J.S

    A.V . Oppenheim and J.S. Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981. doi: 10.1109/PROC.1981.12022

  32. [32]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII, page 86–103, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58609-6. doi: 10.1007/ 978-3-03...

  33. [33]

    Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019

    Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–11, 2019. URL https://api. semanticscholar.org/CorpusID:59292011

  34. [34]

    Analysis and visualization of temporal variations in video

    Michael Rubinstein. Analysis and visualization of temporal variations in video. 2014. URL https://api.semanticscholar.org/CorpusID:41891254

  35. [35]

    A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024

    Sefik Serengil and Alper Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024. doi: 10.17671/gazibtd.1399077. URL https://dergipark.org.tr/en/pub/gazibtd/issue/ 84331/1399077

  36. [36]

    A ConvNet for the 2020s

    Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18699–18708, 2022. doi: 10.1109/CVPR52688.2022.01816

  37. [37]

    Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020

    Rubén Tolosana, Rubén Vera-Rodríguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detec- tion.ArXiv, abs/2001.00179, 2020. URL https://api.semanticscholar.org/CorpusID: 209531954

  38. [38]

    Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

    Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks.2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2014. URL https://api.semanticscholar. org/CorpusID:1122604

  39. [39]

    Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020

    Luisa Verdoliva. Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Processing, 14:910–932, 2020. URL https://api.semanticscholar. org/CorpusID:210838881

  40. [40]

    Neal Wadhwa, Michael Rubinstein, Frédo Durand, and William T. Freeman. Phase-based video motion processing.ACM Trans. Graph., 32(4), July 2013. ISSN 0730-0301. doi: 10.1145/2461912.2461966. URLhttps://doi.org/10.1145/2461912.2461966

  41. [41]

    Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023. URLhttps://api.semanticscholar.org/CorpusID:257805127

  42. [42]

    Exposing digital forgeries in video by detecting double mpeg compression

    Weihong Wang and Hany Farid. Exposing digital forgeries in video by detecting double mpeg compression. InProceedings of the 8th Workshop on Multimedia and Security, MM&Sec ’06, page 37–47, New York, NY , USA, 2006. Association for Computing Machinery. ISBN 1595934936. doi: 10.1145/1161366.1161375. URL https://doi.org/10.1145/1161366. 1161375. 12

  43. [43]

    Yan Wang, Qindong Sun, Dongzhu Rong, and Rong Geng. Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts.Computer Vision and Image Understanding, 247:104072, 2024. ISSN 1077-

  44. [44]

    URL https://www.sciencedirect

    doi: https://doi.org/10.1016/j.cviu.2024.104072. URL https://www.sciencedirect. com/science/article/pii/S107731422400153X

  45. [45]

    Interactive editing of deformable simulations , year =

    Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand, and William Freeman. Eulerian video magnification for revealing subtle changes in the world.ACM Trans. Graph., 31(4), July 2012. ISSN 0730-0301. doi: 10.1145/2185520.2185561. URL https://doi.org/10.1145/2185520.2185561

  46. [46]

    Tall: Thumbnail layout for deepfake video detection

    Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

  47. [47]

    Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8984–8994, 2023. URLhttps://api.semanticscholar.org/CorpusID:265294623

  48. [48]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22355–22366, 2023. doi: 10.1109/ICCV51070.2023.02048

  49. [49]

    Orthogonal subspace decomposition for generalizable ai-generated image detection

    Zhiyuan Yan, Jiangming Wang, Zhendong Wang, Peng Jin, Ke-Yue Zhang, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InInternational Conference on Machine Learning,

  50. [50]

    URLhttps://api.semanticscholar.org/CorpusID:274234236

  51. [51]

    Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, and Li Yuan. Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotem- poral adapter tuning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12615–12625, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272310564

  52. [52]

    Benchmarking the robustness of spatial-temporal models against corruptions, 2022

    Chenyu Yi, Siyuan Yang, Haoliang Li, Yap peng Tan, and Alex Kot. Benchmarking the robustness of spatial-temporal models against corruptions, 2022. URL https://arxiv.org/ abs/2110.06513

  53. [53]

    Cubuk, and Justin Gilmer

    Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision, 2020. URLhttps://arxiv.org/ abs/1906.08988

  54. [54]

    Exploring temporal coherence for more general video face forgery detection

    Zheng Yinglin, Bao Jianmin, Chen Dong, Zeng Ming, and Wen Fang. Exploring temporal coherence for more general video face forgery detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021

  55. [55]

    Learning natural consistency representation for face forgery video detection

    Daichi Zhang, Zihao Xiao, Shikun Li, Fanzhao Lin, Jianmin Li, and Shiming Ge. Learning natural consistency representation for face forgery video detection. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXXIII, page 407–424, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031...

  56. [56]

    Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

    Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion.CVPR, 2023

  57. [57]

    Altfreezing for more general video face forgery detection

    Wang Zhendong, Bao Jianmin, Zhou Wengang, Wang Weilun, and Li Houqiang. Altfreezing for more general video face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4129–4138, June 2023

  58. [58]

    Wilddeepfake: A challenging real-world dataset for deepfake detection

    Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. InProceedings of the 28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 13