pith. sign in

arxiv: 2605.17133 · v1 · pith:7X4667IAnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

Pith reviewed 2026-05-20 15:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video forgery detectiondeepfakecross-attentionmultimodal fusionmedia forensicsgenerative video
0
0 comments X

The pith

Cross-attention fusion of appearance, motion, and depth features detects video forgeries by exposing cross-modal contradictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that video forgery detectors should shift from single-modality analysis to cross-modal reasoning. Advanced generators create videos that look consistent in appearance or motion alone but show mismatches when these are compared. The proposed CAM-VFD uses a cross-attention setup where visual features query temporal and geometric ones to highlight these mismatches. Results indicate this leads to strong performance on generative benchmarks and holds up under various attacks. If correct, it means future forensics tools must account for how different aspects of a video relate to each other rather than checking them separately.

Core claim

The authors claim that modeling cross-modal contradictions as a directional forensic signal via cross-attention fusion enables better identification of manipulated videos. Specifically, CLIP appearance representations query VideoMAE motion features and MiDaS depth features, producing attention patterns that separate real from fake distributions with statistical significance. This yields accuracies of 95.31% on GenVidBench and 93.43% on GenVideo while maintaining robustness to compression and perturbations.

What carries the argument

The cross-attention fusion mechanism that uses appearance representations as queries to motion and depth features.

If this is right

  • The approach achieves over 93% accuracy and high AUROC on generative video benchmarks.
  • It maintains performance stability under compression, noise, blur, and adversarial perturbations.
  • Cross-modal contradiction detection offers a new signal for media forensics that single-modality methods lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generators could adapt by enforcing consistency across CLIP, VideoMAE, and MiDaS features, potentially reducing the method's effectiveness.
  • The framework might apply to detecting forgeries in other multimodal domains like audio-visual content.
  • Future work could test this on larger or more diverse video datasets to confirm generalizability.

Load-bearing premise

That the cross-modal contradictions created by today's generators remain forensically useful and are effectively captured by the chosen feature extractors and attention design.

What would settle it

A generator that produces videos with enforced consistency across visual, temporal, and geometric modalities, resulting in indistinguishable attention discrepancy distributions for real and fake samples, would falsify the central advantage.

Figures

Figures reproduced from arXiv: 2605.17133 by Dalia Sobhy, Hoda Osama Elkhodary, Marwa Elshenawy, Sherin Mostafa Youssef.

Figure 1
Figure 1. Figure 1: Proposed Cross-Attention Multimodal Video Forgery Detection (CAM-VFD) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multimodal feature heatmaps extracted from real and AI-generated video samples by CAM-VFD. Across all three modalities, real videos (left) exhibit strong cross-modal consistency, with coherent spatial structures, stable depth transitions, and physically plausible motion patterns reflecting underlying scene geometry. In contrast, AI-generated videos (right) expose distinctive inconsistency signatures: fragm… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed cross-attention fusion mechanism. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Cross-Modal Attention Discrepancy (CMAD) scores for real and [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradictions as a forensic signal. Appearance representations from CLIP serve as queries in a cross-attention fusion against VideoMAE motion features and MiDaS depth features. The work reports 95.31% Top-1 accuracy on GenVidBench and 93.43% accuracy (90.63% F1, 96.56% AUROC) on GenVideo, with statistically separable real/fake attention distributions (p<0.001, Cohen's d=0.68), robustness under compression/noise/blur/adversarial perturbations, and publicly released code.

Significance. If the cross-attention mechanism specifically surfaces forensically discriminative cross-modal contradictions rather than generic domain shifts, the approach could advance multimodal video forensics beyond single-modality detectors. The public code release and reporting of p-values plus effect sizes are explicit strengths that aid reproducibility and allow direct assessment of the separability claim.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): the reported accuracies and attention discrepancy statistics are given without baseline comparisons that use the same CLIP/VideoMAE/MiDaS extractors in single-modality mode or without the cross-attention fusion; this omission directly affects the ability to attribute performance to the claimed cross-modal contradiction modeling.
  2. [§4.3] §4.3 (cross-modal attention discrepancy analysis): the observed separation (p<0.001, d=0.68) is consistent with generic distribution shift in the three pretrained extractors (all trained exclusively on real data) rather than directional forensic cross-modal inconsistency; no control experiment with generators that enforce cross-modal consistency is reported to isolate the asserted signal.
minor comments (2)
  1. [Abstract] Abstract: GenVidBench result is labeled 'Top-1 accuracy' while GenVideo uses plain 'accuracy'; adopt consistent metric terminology throughout.
  2. The manuscript would benefit from explicit statements of data splits, training hyperparameters, and optimizer choices even though code is released.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the reported accuracies and attention discrepancy statistics are given without baseline comparisons that use the same CLIP/VideoMAE/MiDaS extractors in single-modality mode or without the cross-attention fusion; this omission directly affects the ability to attribute performance to the claimed cross-modal contradiction modeling.

    Authors: We agree that direct comparisons with single-modality baselines using the identical extractors are necessary to isolate the contribution of the cross-attention fusion. In the revised manuscript we will add these baselines: independent classifiers (MLP head) trained on CLIP appearance features alone, VideoMAE motion features alone, and MiDaS depth features alone, all evaluated on the same GenVidBench and GenVideo splits with the same metrics. These results will be reported alongside the full CAM-VFD model to quantify the performance gain attributable to cross-modal contradiction modeling. revision: yes

  2. Referee: [§4.3] §4.3 (cross-modal attention discrepancy analysis): the observed separation (p<0.001, d=0.68) is consistent with generic distribution shift in the three pretrained extractors (all trained exclusively on real data) rather than directional forensic cross-modal inconsistency; no control experiment with generators that enforce cross-modal consistency is reported to isolate the asserted signal.

    Authors: We acknowledge that the reported separation could in principle reflect generic distribution shift rather than specifically forensic cross-modal inconsistency. The cross-attention design is motivated by the hypothesis that advanced generators preserve intra-modal coherence while breaking inter-modal relations, but we agree that a control set of videos generated under explicit cross-modal consistency constraints would provide stronger isolation. No such control data currently exists in public benchmarks, and constructing it would require new generative pipelines outside the scope of the present study. In the revision we will expand §4.3 with additional discussion of this alternative explanation, include qualitative attention visualizations that illustrate modality-specific contradictions, and list the absence of consistency-controlled generators as an explicit limitation and avenue for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper defines CAM-VFD as a cross-attention architecture that fuses CLIP appearance queries with VideoMAE motion and MiDaS depth features, then reports measured accuracies (95.31% on GenVidBench, 93.43% on GenVideo) and statistical separability (p<0.001, d=0.68) on independent generative video benchmarks. No equations are shown that reduce these outcomes to internal attention scores or fitted parameters by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The derivation chain consists of a proposed architecture plus external evaluation; the reported performance is not tautologically equivalent to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the chosen pre-trained models extract sufficiently independent signals and that cross-attention will surface generator-induced inconsistencies; no explicit free parameters, new axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5810 in / 1219 out tokens · 58790 ms · 2026-05-20T15:02:45.870834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    Deepfake video detec- tion: challenges and opportunities,

    A. Kaur, A. N. Hoshyar, V. Saikrishna, S. Firmin, and F. Xia, “Deepfake video detec- tion: challenges and opportunities,”Artificial Intelligence Review, vol. 57, no. 6, p. 159, 2024. 21

  2. [2]

    Deepfakes generation and detection: A short survey,

    Z. Akhtar, “Deepfakes generation and detection: A short survey,”Journal of Imaging, vol. 9, p. 18, 2023

  3. [3]

    Deepfake detection by analyzing convolu- tional traces,

    L. Guarnera, O. Giudice, and S. Battiato, “Deepfake detection by analyzing convolu- tional traces,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 666–667, 2020

  4. [4]

    Exploiting visual artifacts to expose deep- fakes and face manipulations,

    F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to expose deep- fakes and face manipulations,” in2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), 2019, pp. 83–92

  5. [5]

    Head pose estimation patterns as deepfake detectors,

    F. Becattini, C. Bisogni, V. Loia, C. Pero, and F. Hao, “Head pose estimation patterns as deepfake detectors,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 11, pp. 1–24, 2024

  6. [6]

    Exposing lip-syncing deepfakes from mouth incon- sistencies,

    S. K. Datta, S. Jia, and S. Lyu, “Exposing lip-syncing deepfakes from mouth incon- sistencies,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6

  7. [7]

    Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes,

    W. Liu, T. She, J. Liu, B. Li, D. Yao, and R. Wang, “Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes,”Advances in Neural Information Processing Systems, vol. 37, pp. 91 131–91 155, 2024

  8. [8]

    Learning spatio-temporal features to detect manipulated facial videos created by the deepfake techniques,

    X. H. Nguyen, T. S. Tran, V. T. Le, K. D. Nguyen, and D.-T. Truong, “Learning spatio-temporal features to detect manipulated facial videos created by the deepfake techniques,”Forensic Science International: Digital Investigation, vol. 36, p. 301108, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S2666281721000020

  9. [9]

    Ai-generated video detection via spatial-temporal anomaly learning,

    J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460–470

  10. [10]

    Human action clips: Detecting ai-generated human motion,

    M. Bohacek and H. Farid, “Human action clips: Detecting ai-generated human motion,” arXiv preprint arXiv:2412.00526, 2024

  11. [11]

    Generalizable deepfake detection with phase-based motion analysis,

    E. Prashnani, M. Goebel, and B. S. Manjunath, “Generalizable deepfake detection with phase-based motion analysis,”IEEE Transactions on Image Processing, vol. 34, pp. 100–112, 2025

  12. [12]

    A recipe for scaling up text-to-video generation with text-free videos,

    X. Wang, S. Zhang, H. Yuan, Z. Qing, B. Gong, Y. Zhang, Y. Shen, C. Gao, and N. Sang, “A recipe for scaling up text-to-video generation with text-free videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6572–6582

  13. [13]

    Make it move: controllable image-to-video generation with text descriptions,

    Y. Hu, C. Luo, and Z. Chen, “Make it move: controllable image-to-video generation with text descriptions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 219–18 228

  14. [14]

    Exposing DeepFake Videos By Detecting Face Warping Artifacts

    Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping artifacts,”arXiv preprint arXiv:1811.00656, 2018. 22

  15. [15]

    M2tr: Multi-modal multi-scale transformers for deepfake detection,

    J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N. Li, “M2tr: Multi-modal multi-scale transformers for deepfake detection,” inProceedings of the 2022 International Conference on Multimedia Retrieval, ser. ICMR ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 615–623. [Online]. Available: https://doi.org/10.1145/3512527.3531415

  16. [16]

    Multimodal approach for deep- fake detection,

    M. Lomnitz, Z. Hampel-Arias, V. Sandesara, and S. Hu, “Multimodal approach for deep- fake detection,” in2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), 2020, pp. 1–9

  17. [17]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  18. [18]

    Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training,

    Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022

  19. [19]

    Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

  20. [20]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

  21. [21]

    Deepfake detection using spatio-temporal-structural anomaly learning and fuzzy system-based decision fusion,

    B. Subburaj and R. Ragavendra, “Deepfake detection using spatio-temporal-structural anomaly learning and fuzzy system-based decision fusion,”IEEE Access, vol. 13, pp. 82 747–82 758, 2025

  22. [22]

    Deepfake detection via inter-frame inconsistency recomposition and enhancement,

    C. Zhu, B. Zhang, Q. Yin, C. Yin, and W. Lu, “Deepfake detection via inter-frame inconsistency recomposition and enhancement,”Pattern Recognition, vol. 147, p. 110077, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0031320323007744

  23. [23]

    Enhanced deepfake detection via dynamic data augmentation and spatiotemporal attention,

    T. Zhang, G. Li, Y. Xiao, H. Tian, and Y. Cao, “Enhanced deepfake detection via dynamic data augmentation and spatiotemporal attention,”Vis. Comput., vol. 42, no. 1, Dec. 2025. [Online]. Available: https://doi.org/10.1007/s00371-025-04199-8

  24. [24]

    Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,

    D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,” 2025. [Online]. Available: https://arxiv.org/abs/2501.01184

  25. [25]

    Exploiting complementary dynamic incoherence for deepfake video detection,

    H. Wang, Z. Liu, and S. Wang, “Exploiting complementary dynamic incoherence for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 33, no. 8, pp. 4027–4040, 2023. 23

  26. [26]

    Delving into the local: Dynamic inconsistency learning for deepfake video detection,

    Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, and L. Ma, “Delving into the local: Dynamic inconsistency learning for deepfake video detection,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 744–752, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/19955

  27. [27]

    Avff: Audio-visual feature fusion for video deepfake detection,

    T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y. Yacoob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 102–27 112

  28. [28]

    What matters in detecting ai-generated videos like sora?

    C. Chang, Z. Liu, X. Lyu, and X. Qi, “What matters in detecting ai-generated videos like sora?”arXiv preprint arXiv:2406.19568, 2024

  29. [29]

    Cad: A general multimodal framework for video deepfake detection via cross-modal alignment and distillation,

    Y. Du, Z. Wang, Y. Luo, C. Piao, Z. Yan, H. Li, and L. Yuan, “Cad: A general multimodal framework for video deepfake detection via cross-modal alignment and distillation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15233

  30. [30]

    A spatial-frequency aware multi-scale fusion network for real-time deepfake detection,

    L. Lv, T. Wang, M. Huang, R. Liu, and Y. Wang, “A spatial-frequency aware multi-scale fusion network for real-time deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2508.20449

  31. [31]

    Multimodal consistency-driven deepfake detection,

    L. Zhang, B. Liu, Q. Chu, and N. Yu, “Multimodal consistency-driven deepfake detection,” inImage and Graphics: 13th International Conference, ICIG 2025, Xuzhou, China, October 31 – November 2, 2025, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2025, p. 293–303. [Online]. Available: https://doi.org/10.1007/978-981-95-3393-0 24

  32. [32]

    AdaFrame: Adaptive Frame Selection for Fast Video Recognition

    Z. Wu, C. Xiong, C. Ma, R. Socher, and L. S. Davis, “Adaframe: Adaptive frame selection for fast video recognition,”CoRR, vol. abs/1811.12432, 2018. [Online]. Available: http://arxiv.org/abs/1811.12432

  33. [34]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    [Online]. Available: http://arxiv.org/abs/1912.01703

  34. [35]

    Genvidbench: A challenging benchmark for detecting ai-generated video,

    Z. Ni, Q. Yan, M. Huang, T. Yuan, Y. Tang, H. Hu, X. Chen, and Y. Wang, “Genvidbench: A challenging benchmark for detecting ai-generated video,” 2025. [Online]. Available: https://arxiv.org/abs/2501.11340

  35. [36]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark

    H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, W. Wang, and H. Li, “Demamba: Ai-generated video detection on million-scale genvideo benchmark,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19707

  36. [37]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. [Online]. Available: https://arxiv.org/abs/1705.07750

  37. [38]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” 2019. [Online]. Available: https://arxiv.org/abs/1812.03982 24

  38. [39]

    Temporal pyramid network for action recognition,

    C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03548

  39. [40]

    Temporal interlacing network,

    H. Shao, S. Qian, and Y. Liu, “Temporal interlacing network,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 966–11 973, Apr

  40. [41]

    Available: https://ojs.aaai.org/index.php/AAAI/article/view/6872

    [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6872

  41. [42]

    X3d: Expanding architectures for efficient video recognition,

    C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” 2020. [Online]. Available: https://arxiv.org/abs/2004.04730

  42. [43]

    Temporal Relational Reasoning in Videos

    B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” 2018. [Online]. Available: https://arxiv.org/abs/1711.08496

  43. [44]

    Tsm: Temporal shift module for efficient video understanding,

    J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1811.08383

  44. [45]

    Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,

    K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, L. Wang, and Y. Qiao, “Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,” 2022. [Online]. Available: https://arxiv.org/abs/2211.09552

  45. [46]

    Is space-time attention all you need for video understanding?

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” 2021. [Online]. Available: https://arxiv.org/abs/2102.05095

  46. [47]

    Video swin transformer,

    Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” 2021. [Online]. Available: https://arxiv.org/abs/2106.13230

  47. [48]

    Mvitv2: Improved multiscale vision transformers for classification and detection,

    Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,”

  48. [49]

    Available: https://arxiv.org/abs/2112.01526

    [Online]. Available: https://arxiv.org/abs/2112.01526

  49. [50]

    Physics-driven spatiotemporal modeling for ai-generated video detection,

    S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” 2025. [Online]. Available: https://arxiv.org/abs/2510.08073

  50. [51]

    Stil: Semi-supervised tabular-image learning for comprehensive task-relevant information exploration in multimodal classification,

    S. Du, X. Luo, D. P. O’Regan, and C. Qin, “Stil: Semi-supervised tabular-image learning for comprehensive task-relevant information exploration in multimodal classification,” 2025. [Online]. Available: https://arxiv.org/abs/2503.06277

  51. [52]

    Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection,

    C. Tan, H. Liu, Y. Zhao, S. Wei, G. Gu, P. Liu, and Y. Wei, “Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection,” 2023. [Online]. Available: https://arxiv.org/abs/2312.10461

  52. [53]

    Tall: Thumbnail layout for deepfake video detection,

    Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” 2024. [Online]. Available: https://arxiv.org/abs/2307.07494

  53. [54]

    The probable error of a mean,

    Student, “The probable error of a mean,”Biometrika, pp. 1–25, 1908

  54. [55]

    Statistical power analysis for the behavioral sciences,

    J. Cohen, “Statistical power analysis for the behavioral sciences,” 1988

  55. [56]

    Evaluating the robustness of deep learning models against adversarial attacks: An analysis with fgsm, pgd and 25 cw,

    W. Villegas-Ch, A. Jaramillo-Alc´ azar, and S. Luj´ an-Mora, “Evaluating the robustness of deep learning models against adversarial attacks: An analysis with fgsm, pgd and 25 cw,”Big Data and Cognitive Computing, vol. 8, no. 1, 2024. [Online]. Available: https://www.mdpi.com/2504-2289/8/1/8 26