CAM-VFD: Cross-Attention Multimodal Video Forgery Detection
Pith reviewed 2026-05-20 15:02 UTC · model grok-4.3
The pith
Cross-attention fusion of appearance, motion, and depth features detects video forgeries by exposing cross-modal contradictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that modeling cross-modal contradictions as a directional forensic signal via cross-attention fusion enables better identification of manipulated videos. Specifically, CLIP appearance representations query VideoMAE motion features and MiDaS depth features, producing attention patterns that separate real from fake distributions with statistical significance. This yields accuracies of 95.31% on GenVidBench and 93.43% on GenVideo while maintaining robustness to compression and perturbations.
What carries the argument
The cross-attention fusion mechanism that uses appearance representations as queries to motion and depth features.
If this is right
- The approach achieves over 93% accuracy and high AUROC on generative video benchmarks.
- It maintains performance stability under compression, noise, blur, and adversarial perturbations.
- Cross-modal contradiction detection offers a new signal for media forensics that single-modality methods lack.
Where Pith is reading between the lines
- Generators could adapt by enforcing consistency across CLIP, VideoMAE, and MiDaS features, potentially reducing the method's effectiveness.
- The framework might apply to detecting forgeries in other multimodal domains like audio-visual content.
- Future work could test this on larger or more diverse video datasets to confirm generalizability.
Load-bearing premise
That the cross-modal contradictions created by today's generators remain forensically useful and are effectively captured by the chosen feature extractors and attention design.
What would settle it
A generator that produces videos with enforced consistency across visual, temporal, and geometric modalities, resulting in indistinguishable attention discrepancy distributions for real and fake samples, would falsify the central advantage.
Figures
read the original abstract
The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradictions as a forensic signal. Appearance representations from CLIP serve as queries in a cross-attention fusion against VideoMAE motion features and MiDaS depth features. The work reports 95.31% Top-1 accuracy on GenVidBench and 93.43% accuracy (90.63% F1, 96.56% AUROC) on GenVideo, with statistically separable real/fake attention distributions (p<0.001, Cohen's d=0.68), robustness under compression/noise/blur/adversarial perturbations, and publicly released code.
Significance. If the cross-attention mechanism specifically surfaces forensically discriminative cross-modal contradictions rather than generic domain shifts, the approach could advance multimodal video forensics beyond single-modality detectors. The public code release and reporting of p-values plus effect sizes are explicit strengths that aid reproducibility and allow direct assessment of the separability claim.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experimental Results): the reported accuracies and attention discrepancy statistics are given without baseline comparisons that use the same CLIP/VideoMAE/MiDaS extractors in single-modality mode or without the cross-attention fusion; this omission directly affects the ability to attribute performance to the claimed cross-modal contradiction modeling.
- [§4.3] §4.3 (cross-modal attention discrepancy analysis): the observed separation (p<0.001, d=0.68) is consistent with generic distribution shift in the three pretrained extractors (all trained exclusively on real data) rather than directional forensic cross-modal inconsistency; no control experiment with generators that enforce cross-modal consistency is reported to isolate the asserted signal.
minor comments (2)
- [Abstract] Abstract: GenVidBench result is labeled 'Top-1 accuracy' while GenVideo uses plain 'accuracy'; adopt consistent metric terminology throughout.
- The manuscript would benefit from explicit statements of data splits, training hyperparameters, and optimizer choices even though code is released.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the reported accuracies and attention discrepancy statistics are given without baseline comparisons that use the same CLIP/VideoMAE/MiDaS extractors in single-modality mode or without the cross-attention fusion; this omission directly affects the ability to attribute performance to the claimed cross-modal contradiction modeling.
Authors: We agree that direct comparisons with single-modality baselines using the identical extractors are necessary to isolate the contribution of the cross-attention fusion. In the revised manuscript we will add these baselines: independent classifiers (MLP head) trained on CLIP appearance features alone, VideoMAE motion features alone, and MiDaS depth features alone, all evaluated on the same GenVidBench and GenVideo splits with the same metrics. These results will be reported alongside the full CAM-VFD model to quantify the performance gain attributable to cross-modal contradiction modeling. revision: yes
-
Referee: [§4.3] §4.3 (cross-modal attention discrepancy analysis): the observed separation (p<0.001, d=0.68) is consistent with generic distribution shift in the three pretrained extractors (all trained exclusively on real data) rather than directional forensic cross-modal inconsistency; no control experiment with generators that enforce cross-modal consistency is reported to isolate the asserted signal.
Authors: We acknowledge that the reported separation could in principle reflect generic distribution shift rather than specifically forensic cross-modal inconsistency. The cross-attention design is motivated by the hypothesis that advanced generators preserve intra-modal coherence while breaking inter-modal relations, but we agree that a control set of videos generated under explicit cross-modal consistency constraints would provide stronger isolation. No such control data currently exists in public benchmarks, and constructing it would require new generative pipelines outside the scope of the present study. In the revision we will expand §4.3 with additional discussion of this alternative explanation, include qualitative attention visualizations that illustrate modality-specific contradictions, and list the absence of consistency-controlled generators as an explicit limitation and avenue for future work. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper defines CAM-VFD as a cross-attention architecture that fuses CLIP appearance queries with VideoMAE motion and MiDaS depth features, then reports measured accuracies (95.31% on GenVidBench, 93.43% on GenVideo) and statistical separability (p<0.001, d=0.68) on independent generative video benchmarks. No equations are shown that reduce these outcomes to internal attention scores or fitted parameters by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The derivation chain consists of a proposed architecture plus external evaluation; the reported performance is not tautologically equivalent to the inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cross-Modal Attention Discrepancy (CMAD) ... p<0.001, Cohen’s d=0.68
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deepfake video detec- tion: challenges and opportunities,
A. Kaur, A. N. Hoshyar, V. Saikrishna, S. Firmin, and F. Xia, “Deepfake video detec- tion: challenges and opportunities,”Artificial Intelligence Review, vol. 57, no. 6, p. 159, 2024. 21
work page 2024
-
[2]
Deepfakes generation and detection: A short survey,
Z. Akhtar, “Deepfakes generation and detection: A short survey,”Journal of Imaging, vol. 9, p. 18, 2023
work page 2023
-
[3]
Deepfake detection by analyzing convolu- tional traces,
L. Guarnera, O. Giudice, and S. Battiato, “Deepfake detection by analyzing convolu- tional traces,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 666–667, 2020
work page 2020
-
[4]
Exploiting visual artifacts to expose deep- fakes and face manipulations,
F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to expose deep- fakes and face manipulations,” in2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), 2019, pp. 83–92
work page 2019
-
[5]
Head pose estimation patterns as deepfake detectors,
F. Becattini, C. Bisogni, V. Loia, C. Pero, and F. Hao, “Head pose estimation patterns as deepfake detectors,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 11, pp. 1–24, 2024
work page 2024
-
[6]
Exposing lip-syncing deepfakes from mouth incon- sistencies,
S. K. Datta, S. Jia, and S. Lyu, “Exposing lip-syncing deepfakes from mouth incon- sistencies,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6
work page 2024
-
[7]
W. Liu, T. She, J. Liu, B. Li, D. Yao, and R. Wang, “Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes,”Advances in Neural Information Processing Systems, vol. 37, pp. 91 131–91 155, 2024
work page 2024
-
[8]
X. H. Nguyen, T. S. Tran, V. T. Le, K. D. Nguyen, and D.-T. Truong, “Learning spatio-temporal features to detect manipulated facial videos created by the deepfake techniques,”Forensic Science International: Digital Investigation, vol. 36, p. 301108, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S2666281721000020
work page 2021
-
[9]
Ai-generated video detection via spatial-temporal anomaly learning,
J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460–470
work page 2024
-
[10]
Human action clips: Detecting ai-generated human motion,
M. Bohacek and H. Farid, “Human action clips: Detecting ai-generated human motion,” arXiv preprint arXiv:2412.00526, 2024
-
[11]
Generalizable deepfake detection with phase-based motion analysis,
E. Prashnani, M. Goebel, and B. S. Manjunath, “Generalizable deepfake detection with phase-based motion analysis,”IEEE Transactions on Image Processing, vol. 34, pp. 100–112, 2025
work page 2025
-
[12]
A recipe for scaling up text-to-video generation with text-free videos,
X. Wang, S. Zhang, H. Yuan, Z. Qing, B. Gong, Y. Zhang, Y. Shen, C. Gao, and N. Sang, “A recipe for scaling up text-to-video generation with text-free videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6572–6582
work page 2024
-
[13]
Make it move: controllable image-to-video generation with text descriptions,
Y. Hu, C. Luo, and Z. Chen, “Make it move: controllable image-to-video generation with text descriptions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 219–18 228
work page 2022
-
[14]
Exposing DeepFake Videos By Detecting Face Warping Artifacts
Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping artifacts,”arXiv preprint arXiv:1811.00656, 2018. 22
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
M2tr: Multi-modal multi-scale transformers for deepfake detection,
J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N. Li, “M2tr: Multi-modal multi-scale transformers for deepfake detection,” inProceedings of the 2022 International Conference on Multimedia Retrieval, ser. ICMR ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 615–623. [Online]. Available: https://doi.org/10.1145/3512527.3531415
-
[16]
Multimodal approach for deep- fake detection,
M. Lomnitz, Z. Hampel-Arias, V. Sandesara, and S. Hu, “Multimodal approach for deep- fake detection,” in2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), 2020, pp. 1–9
work page 2020
-
[17]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[18]
Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training,
Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022
work page 2022
-
[19]
Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020
work page 2020
-
[20]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188
work page 2021
-
[21]
B. Subburaj and R. Ragavendra, “Deepfake detection using spatio-temporal-structural anomaly learning and fuzzy system-based decision fusion,”IEEE Access, vol. 13, pp. 82 747–82 758, 2025
work page 2025
-
[22]
Deepfake detection via inter-frame inconsistency recomposition and enhancement,
C. Zhu, B. Zhang, Q. Yin, C. Yin, and W. Lu, “Deepfake detection via inter-frame inconsistency recomposition and enhancement,”Pattern Recognition, vol. 147, p. 110077, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0031320323007744
work page 2024
-
[23]
Enhanced deepfake detection via dynamic data augmentation and spatiotemporal attention,
T. Zhang, G. Li, Y. Xiao, H. Tian, and Y. Cao, “Enhanced deepfake detection via dynamic data augmentation and spatiotemporal attention,”Vis. Comput., vol. 42, no. 1, Dec. 2025. [Online]. Available: https://doi.org/10.1007/s00371-025-04199-8
-
[24]
Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,
D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection,” 2025. [Online]. Available: https://arxiv.org/abs/2501.01184
-
[25]
Exploiting complementary dynamic incoherence for deepfake video detection,
H. Wang, Z. Liu, and S. Wang, “Exploiting complementary dynamic incoherence for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 33, no. 8, pp. 4027–4040, 2023. 23
work page 2023
-
[26]
Delving into the local: Dynamic inconsistency learning for deepfake video detection,
Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, and L. Ma, “Delving into the local: Dynamic inconsistency learning for deepfake video detection,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 744–752, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/19955
work page 2022
-
[27]
Avff: Audio-visual feature fusion for video deepfake detection,
T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y. Yacoob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 102–27 112
work page 2024
-
[28]
What matters in detecting ai-generated videos like sora?
C. Chang, Z. Liu, X. Lyu, and X. Qi, “What matters in detecting ai-generated videos like sora?”arXiv preprint arXiv:2406.19568, 2024
-
[29]
Y. Du, Z. Wang, Y. Luo, C. Piao, Z. Yan, H. Li, and L. Yuan, “Cad: A general multimodal framework for video deepfake detection via cross-modal alignment and distillation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15233
-
[30]
A spatial-frequency aware multi-scale fusion network for real-time deepfake detection,
L. Lv, T. Wang, M. Huang, R. Liu, and Y. Wang, “A spatial-frequency aware multi-scale fusion network for real-time deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2508.20449
-
[31]
Multimodal consistency-driven deepfake detection,
L. Zhang, B. Liu, Q. Chu, and N. Yu, “Multimodal consistency-driven deepfake detection,” inImage and Graphics: 13th International Conference, ICIG 2025, Xuzhou, China, October 31 – November 2, 2025, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2025, p. 293–303. [Online]. Available: https://doi.org/10.1007/978-981-95-3393-0 24
-
[32]
AdaFrame: Adaptive Frame Selection for Fast Video Recognition
Z. Wu, C. Xiong, C. Ma, R. Socher, and L. S. Davis, “Adaframe: Adaptive frame selection for fast video recognition,”CoRR, vol. abs/1811.12432, 2018. [Online]. Available: http://arxiv.org/abs/1811.12432
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
[Online]. Available: http://arxiv.org/abs/1912.01703
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[35]
Genvidbench: A challenging benchmark for detecting ai-generated video,
Z. Ni, Q. Yan, M. Huang, T. Yuan, Y. Tang, H. Hu, X. Chen, and Y. Wang, “Genvidbench: A challenging benchmark for detecting ai-generated video,” 2025. [Online]. Available: https://arxiv.org/abs/2501.11340
-
[36]
Demamba: Ai-generated video detection on million-scale genvideo benchmark
H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, W. Wang, and H. Li, “Demamba: Ai-generated video detection on million-scale genvideo benchmark,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19707
-
[37]
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. [Online]. Available: https://arxiv.org/abs/1705.07750
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
Slowfast networks for video recognition,
C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” 2019. [Online]. Available: https://arxiv.org/abs/1812.03982 24
-
[39]
Temporal pyramid network for action recognition,
C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03548
-
[40]
H. Shao, S. Qian, and Y. Liu, “Temporal interlacing network,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 966–11 973, Apr
-
[41]
Available: https://ojs.aaai.org/index.php/AAAI/article/view/6872
[Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6872
-
[42]
X3d: Expanding architectures for efficient video recognition,
C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” 2020. [Online]. Available: https://arxiv.org/abs/2004.04730
-
[43]
Temporal Relational Reasoning in Videos
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” 2018. [Online]. Available: https://arxiv.org/abs/1711.08496
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Tsm: Temporal shift module for efficient video understanding,
J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1811.08383
-
[45]
Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,
K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, L. Wang, and Y. Qiao, “Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,” 2022. [Online]. Available: https://arxiv.org/abs/2211.09552
-
[46]
Is space-time attention all you need for video understanding?
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” 2021. [Online]. Available: https://arxiv.org/abs/2102.05095
-
[47]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” 2021. [Online]. Available: https://arxiv.org/abs/2106.13230
-
[48]
Mvitv2: Improved multiscale vision transformers for classification and detection,
Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,”
-
[49]
Available: https://arxiv.org/abs/2112.01526
[Online]. Available: https://arxiv.org/abs/2112.01526
-
[50]
Physics-driven spatiotemporal modeling for ai-generated video detection,
S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” 2025. [Online]. Available: https://arxiv.org/abs/2510.08073
-
[51]
S. Du, X. Luo, D. P. O’Regan, and C. Qin, “Stil: Semi-supervised tabular-image learning for comprehensive task-relevant information exploration in multimodal classification,” 2025. [Online]. Available: https://arxiv.org/abs/2503.06277
-
[52]
C. Tan, H. Liu, Y. Zhao, S. Wei, G. Gu, P. Liu, and Y. Wei, “Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection,” 2023. [Online]. Available: https://arxiv.org/abs/2312.10461
-
[53]
Tall: Thumbnail layout for deepfake video detection,
Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” 2024. [Online]. Available: https://arxiv.org/abs/2307.07494
-
[54]
Student, “The probable error of a mean,”Biometrika, pp. 1–25, 1908
work page 1908
-
[55]
Statistical power analysis for the behavioral sciences,
J. Cohen, “Statistical power analysis for the behavioral sciences,” 1988
work page 1988
-
[56]
W. Villegas-Ch, A. Jaramillo-Alc´ azar, and S. Luj´ an-Mora, “Evaluating the robustness of deep learning models against adversarial attacks: An analysis with fgsm, pgd and 25 cw,”Big Data and Cognitive Computing, vol. 8, no. 1, 2024. [Online]. Available: https://www.mdpi.com/2504-2289/8/1/8 26
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.