pith. sign in

arxiv: 2605.17311 · v1 · pith:ZXT2QP6Pnew · submitted 2026-05-17 · 💻 cs.CV

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

Pith reviewed 2026-05-20 14:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated video detectionspectral featuressemantic featuresFourier transformdeepfake detectionvideo forensicsgenerative modelsgated fusion
0
0 comments X

The pith

SpecSem-Net detects high-fidelity AI videos by guiding spectral denoising with semantic context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SpecSem-Net to detect videos created by advanced generative models such as Sora and Veo, which now produce content that looks realistic enough to fool many existing detectors. Current methods fail because they depend heavily on semantic features that these models have learned to match closely. SpecSem-Net instead extracts high-frequency spectral components through Fourier-based filtering and uses a gated mechanism to blend those components adaptively with semantic information, thereby reducing errors from isolated spectral noise. The authors support this by constructing a benchmark that includes outputs from five leading commercial generators and report higher accuracy than prior approaches on both that benchmark and public datasets. If the approach holds, detectors could continue to separate synthetic videos from authentic ones even as generator quality improves.

Core claim

SpecSem-Net is the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. It extracts high-frequency features via a Fourier-Transform based spectral module and employs a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. On a new benchmark with five state-of-the-art commercial generators the method reaches 87.25 percent accuracy, and it reaches 95.59 percent on public datasets, outperforming existing detectors.

What carries the argument

The semantic-guided spectral denoising mechanism that extracts high-frequency features via Fourier-Transform filtering and then uses gated merging to fuse those features with semantic context while suppressing noise.

Load-bearing premise

High-frequency spectral artifacts remain reliably present and distinguishable even in videos produced by the latest commercial generators such as Sora and Veo.

What would settle it

Generate a test set of videos from a model that explicitly suppresses or randomizes high-frequency spectral content, then check whether SpecSem-Net accuracy falls to the level of semantic-only detectors.

Figures

Figures reproduced from arXiv: 2605.17311 by Huixuaun Zhang, Xiaojun Wan, Zixi Wei.

Figure 1
Figure 1. Figure 1: Overview of the proposed SpecSem-Net. (a) The overall dual-stream architecture, comprising a fixed Semantic Branch (Blue) and a trainable Spectral Branch (Green). (b) The Gated Merging Mechanism uses semantic features to dynamically modulate spectral features, filtering out benign environmental noise. (c) The Spectral Feature Extraction module extracts high-frequency residuals via FFT-based high-pass filte… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the Spark Case. We visualize the feature evolution to demonstrate robustness against environmental noise. (Col 1-2) The high-pass filter inherently captures benign sparks as dominant high-frequency signals. (Col 3) Consequently, the features before the gating mechanism are heavily distracted by this noise. (Col 4) The Gated Merging Mechanism identifies and down-weights these benign texture… view at source ↗
read the original abstract

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpecSem-Net, a framework that extracts high-frequency features via Fourier-Transform based filtering in a spectral module and adaptively fuses them with semantic context using a Gated Merging Mechanism for detecting AI-generated videos. It constructs a new benchmark with videos from 5 SOTA commercial generators (including Sora and Veo) and reports accuracies of 87.25% on this benchmark and 95.59% on public datasets, claiming to be the first to introduce a semantic-guided spectral denoising mechanism for high-fidelity video detection.

Significance. If the empirical results hold under rigorous verification, the work would contribute a practical detector that addresses the failure modes of purely semantic approaches as generative models improve in visual fidelity. The new benchmark covering latest commercial generators is a useful resource for the community. The architectural idea of gating spectral features with semantic context is a reasonable direction, though its advantage depends on the continued presence of detectable high-frequency artifacts.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Setup): the reported accuracies of 87.25% on the new benchmark and 95.59% on public datasets are presented without any description of the experimental protocol, number of samples per generator, train/test splits, baseline methods, or statistical significance tests. This information is load-bearing for the central claim that SpecSem-Net outperforms existing methods on high-fidelity generators.
  2. [§3.2 and §5] §3.2 (Spectral Module) and §5 (Results on Commercial Generators): the claim that the Fourier-based filtering reliably extracts distinguishable high-frequency artifacts rests on the untested assumption that such artifacts survive in videos from Sora and Veo. No ablation or visualization is provided showing that the spectral branch still contributes when these latest models are used; if the artifacts have been suppressed, the gated fusion reduces to a standard semantic classifier and the reported gains may reflect benchmark construction rather than the proposed mechanism.
minor comments (2)
  1. [§3.3] Notation for the Gated Merging Mechanism should be defined with explicit equations rather than descriptive text only.
  2. [Figure 3] Figure captions for spectral visualizations should include the exact frequency cutoff values used in the Fourier filtering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have reviewed each major point carefully and provide point-by-point responses below. We agree that additional details and analyses will strengthen the paper and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the reported accuracies of 87.25% on the new benchmark and 95.59% on public datasets are presented without any description of the experimental protocol, number of samples per generator, train/test splits, baseline methods, or statistical significance tests. This information is load-bearing for the central claim that SpecSem-Net outperforms existing methods on high-fidelity generators.

    Authors: We agree that a complete description of the experimental protocol is necessary to support our claims and ensure reproducibility. In the revised manuscript, we will expand §4 to explicitly detail the number of samples per generator in the new benchmark, the train/test split methodology and ratios, the complete list of baseline methods with implementation references, and the results of statistical significance tests (such as McNemar's test or paired t-tests with p-values) comparing SpecSem-Net against the baselines. These elements were part of our experimental design but were not fully elaborated in the original submission; we will now include them. revision: yes

  2. Referee: [§3.2 and §5] §3.2 (Spectral Module) and §5 (Results on Commercial Generators): the claim that the Fourier-based filtering reliably extracts distinguishable high-frequency artifacts rests on the untested assumption that such artifacts survive in videos from Sora and Veo. No ablation or visualization is provided showing that the spectral branch still contributes when these latest models are used; if the artifacts have been suppressed, the gated fusion reduces to a standard semantic classifier and the reported gains may reflect benchmark construction rather than the proposed mechanism.

    Authors: We acknowledge the validity of this observation. The contribution of the spectral module on the latest high-fidelity generators requires explicit verification. In the revised manuscript, we will add an ablation study in §5 that isolates the performance of the full SpecSem-Net model versus a semantic-only variant on the commercial generators benchmark. We will also include visualizations of the frequency spectra and filtered features for representative samples from Sora and Veo to demonstrate that the high-frequency branch continues to provide distinguishable information. These additions will clarify the role of the gated merging mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architectural proposal or empirical claims

full rationale

The paper presents SpecSem-Net as a new neural network architecture that extracts high-frequency features via Fourier-Transform filtering and fuses them with semantic context using a Gated Merging Mechanism. No equations, derivations, or first-principles results are described that reduce to fitted parameters or inputs by construction. Performance numbers (87.25% on custom benchmark, 95.59% on public datasets) are reported from direct empirical evaluation rather than any self-referential prediction. The 'first framework' claim and benchmark construction do not invoke self-citations or uniqueness theorems that would create a load-bearing circular chain. The work is self-contained as an empirical architecture proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard signal-processing and deep-learning assumptions plus the domain premise that spectral artifacts persist in high-fidelity generators; no new entities are postulated.

free parameters (1)
  • neural network hyperparameters and gating thresholds
    Typical trainable or hand-chosen parameters in any deep architecture; not enumerated in the abstract.
axioms (1)
  • domain assumption Detectable high-frequency spectral artifacts exist in outputs of current commercial video generators
    Invoked to justify the Fourier filtering module as described in the abstract.

pith-pipeline@v0.9.0 · 5715 in / 1133 out tokens · 53601 ms · 2026-05-20T14:15:44.157051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

  1. [1]

    Sora 2 is here.https://openai.com/index/sora-2/, 2025

    OpenAI. Sora 2 is here.https://openai.com/index/sora-2/, 2025. Accessed: 2026-01-27

  2. [2]

    Kling ai: High-quality video generation

    Kuaishou. Kling ai: High-quality video generation. https://klingai.com/, 2025. Accessed: 2026-01- 27

  3. [3]

    Veo: Our most capable generative video model

    Google DeepMind. Veo: Our most capable generative video model. https://deepmind.google/ models/veo/, 2025. Accessed: 2026-01-27

  4. [4]

    Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

    Ping Liu, Qiqi Tao, and Joey Tianyi Zhou. Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

  5. [5]

    The tug-of-war between deepfake generation and detection, 2024

    Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war between deepfake generation and detection, 2024

  6. [6]

    Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

  7. [7]

    D3: Training-free ai-generated video detection using second-order features.arXiv preprint arXiv:2508.00701, 2025

    Chende Zheng, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen, et al. D3: Training-free ai-generated video detection using second-order features.arXiv preprint arXiv:2508.00701, 2025

  8. [8]

    Leveraging frequency analysis for deep fake image recognition

    Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning, pages 3247–3258. PMLR, 2020

  9. [9]

    Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

    Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7890–7899, 2020

  10. [10]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

  11. [11]

    Tall: Thumbnail layout for deepfake video detection

    Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

  12. [12]

    Ai-generated video forgery detection and authentication

    Ayush Kumar Tiwari, Aman Sharma, Poonam Rayakar, Manish Kumar Bhavriya, et al. Ai-generated video forgery detection and authentication. In2024 IEEE 9th International Conference for Convergence in Technology (I2CT), pages 1–8. IEEE, 2024

  13. [13]

    Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

    Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, and Junfeng Yang. Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

  14. [14]

    How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

    Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, and Xiaojuan Qi. How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

  15. [15]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  16. [16]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  17. [17]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  18. [18]

    Introducing Gen-3 Alpha: A new frontier for video generation

    Runway. Introducing Gen-3 Alpha: A new frontier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha, 2024. Accessed: 2024-06-17

  19. [19]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 10

  20. [20]

    Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

  21. [21]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

  22. [22]

    arXiv preprint arXiv:2203.12602 , year=

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.arXiv preprint arXiv:2203.12602, 2022

  23. [23]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  24. [24]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  25. [25]

    Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

    Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muham- mad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

  26. [26]

    Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

    Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

  27. [27]

    Ai-generated video detection via spatio-temporal anomaly learning, 2024

    Jianfa Bai, Man Lin, and Gang Cao. Ai-generated video detection via spatio-temporal anomaly learning, 2024

  28. [28]

    Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

    Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

  29. [29]

    Generalizing face forgery detection with high- frequency features

    Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high- frequency features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021

  30. [30]

    How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

    Siwei Lyu and Hany Farid. How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

  31. [31]

    Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

    Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

  32. [32]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020

  33. [33]

    Bihpf: Bilateral high-pass filters for robust deepfake detection

    Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022

  34. [34]

    Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

    Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, and Efstratios Gavves. Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  35. [35]

    Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

    Jiazhen Yan, Ziqiang Li, Fan Wang, Ziwen He, and Zhangjie Fu. Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

  36. [36]

    The DeepFake Detection Challenge (DFDC) Dataset

    Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset.arXiv preprint arXiv:2006.07397, 2020

  37. [37]

    Focal frequency loss for image reconstruction and synthesis

    Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 13919–13929, 2021

  38. [38]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 11

  39. [39]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  40. [40]

    An algorithm for the machine calculation of complex fourier series

    James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965

  41. [41]

    Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

    Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

  42. [42]

    A closer look at fourier spectrum discrepancies for cnn-generated images detection

    Keshigeyan Chandrasegaran, Ngoc-Trung Tran, and Ngai-Man Cheung. A closer look at fourier spectrum discrepancies for cnn-generated images detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7200–7209, 2021

  43. [43]

    Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

    Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

  44. [44]

    What makes fake images detectable? understanding properties that generalize

    Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020

  45. [45]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  46. [46]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  47. [47]

    Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

    Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

  48. [48]

    Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

    Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

  49. [49]

    Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

    Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

  50. [50]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  51. [51]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  52. [52]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

  53. [53]

    Hailuo ai video.https://hailuoai.com/video, 2024

    MiniMax. Hailuo ai video.https://hailuoai.com/video, 2024

  54. [54]

    Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

    Alibaba Cloud. Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

  55. [55]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  56. [56]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  57. [57]

    Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

  58. [58]

    A golden retriever is running happily across a lush green park while the camera slowly pans to the right

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018. 12 A Robustness in Real-World Scenarios As raised in previous discussions, high-...