SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

Huixuaun Zhang; Xiaojun Wan; Zixi Wei

REVIEW 2 major objections 2 minor 1 cited by

SpecSem-Net detects high-fidelity AI videos by guiding spectral denoising with semantic context.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 14:15 UTC pith:ZXT2QP6P

load-bearing objection SpecSem-Net adds semantic-guided spectral filtering and gated fusion for AI video detection plus a new 5-generator benchmark, but the gains rest on unverified persistence of high-frequency artifacts in models like Sora. the 2 major comments →

arxiv 2605.17311 v1 pith:ZXT2QP6P submitted 2026-05-17 cs.CV

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

Zixi Wei , Huixuaun Zhang , Xiaojun Wan This is my paper

classification cs.CV

keywords AI-generated video detectionspectral featuressemantic featuresFourier transformdeepfake detectionvideo forensicsgenerative modelsgated fusion

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SpecSem-Net to detect videos created by advanced generative models such as Sora and Veo, which now produce content that looks realistic enough to fool many existing detectors. Current methods fail because they depend heavily on semantic features that these models have learned to match closely. SpecSem-Net instead extracts high-frequency spectral components through Fourier-based filtering and uses a gated mechanism to blend those components adaptively with semantic information, thereby reducing errors from isolated spectral noise. The authors support this by constructing a benchmark that includes outputs from five leading commercial generators and report higher accuracy than prior approaches on both that benchmark and public datasets. If the approach holds, detectors could continue to separate synthetic videos from authentic ones even as generator quality improves.

Core claim

SpecSem-Net is the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. It extracts high-frequency features via a Fourier-Transform based spectral module and employs a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. On a new benchmark with five state-of-the-art commercial generators the method reaches 87.25 percent accuracy, and it reaches 95.59 percent on public datasets, outperforming existing detectors.

What carries the argument

The semantic-guided spectral denoising mechanism that extracts high-frequency features via Fourier-Transform filtering and then uses gated merging to fuse those features with semantic context while suppressing noise.

Load-bearing premise

High-frequency spectral artifacts remain reliably present and distinguishable even in videos produced by the latest commercial generators such as Sora and Veo.

What would settle it

Generate a test set of videos from a model that explicitly suppresses or randomizes high-frequency spectral content, then check whether SpecSem-Net accuracy falls to the level of semantic-only detectors.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SpecSem-Net adds semantic-guided spectral filtering and gated fusion for AI video detection plus a new 5-generator benchmark, but the gains rest on unverified persistence of high-frequency artifacts in models like Sora.

read the letter

The main takeaway is that this paper proposes SpecSem-Net, which extracts high-frequency features through Fourier filtering, then uses semantic context to guide denoising and a gated merge to combine the two streams. They also assembled a benchmark with videos from five recent commercial generators and report 87% accuracy on it plus 95% on public sets. That combination and the benchmark are the concrete additions here. The gated mechanism to suppress spectral noise while keeping semantic cues is a reasonable engineering step that could reduce some misclassifications in practice. Building tests against current top-tier generators is useful because older detectors are known to degrade on high-fidelity output. The work shows straightforward engagement with the detection problem and cites relevant prior spectral and semantic approaches without obvious circularity in the architecture. The central soft spot is the assumption that reliable high-frequency artifacts still exist and can be isolated in the latest generators. If Sora, Veo, or similar models have already reduced or randomized those cues through training changes or post-processing, the spectral branch contributes little beyond what a semantic classifier already does, and the benchmark numbers may reflect dataset construction more than a general denoising benefit. The abstract states performance but supplies limited protocol details on baselines, splits, or ablations, which leaves the size of the actual improvement unclear. This is the kind of paper that matters for media forensics and disinformation work. Readers building or evaluating detectors would get value from the benchmark and the fusion idea even if they adapt the spectral part. It is coherent on its own terms and deserves a serious referee to check artifact survival, run fuller comparisons, and verify the experimental controls. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpecSem-Net, a framework that extracts high-frequency features via Fourier-Transform based filtering in a spectral module and adaptively fuses them with semantic context using a Gated Merging Mechanism for detecting AI-generated videos. It constructs a new benchmark with videos from 5 SOTA commercial generators (including Sora and Veo) and reports accuracies of 87.25% on this benchmark and 95.59% on public datasets, claiming to be the first to introduce a semantic-guided spectral denoising mechanism for high-fidelity video detection.

Significance. If the empirical results hold under rigorous verification, the work would contribute a practical detector that addresses the failure modes of purely semantic approaches as generative models improve in visual fidelity. The new benchmark covering latest commercial generators is a useful resource for the community. The architectural idea of gating spectral features with semantic context is a reasonable direction, though its advantage depends on the continued presence of detectable high-frequency artifacts.

major comments (2)

[Abstract and §4] Abstract and §4 (Experimental Setup): the reported accuracies of 87.25% on the new benchmark and 95.59% on public datasets are presented without any description of the experimental protocol, number of samples per generator, train/test splits, baseline methods, or statistical significance tests. This information is load-bearing for the central claim that SpecSem-Net outperforms existing methods on high-fidelity generators.
[§3.2 and §5] §3.2 (Spectral Module) and §5 (Results on Commercial Generators): the claim that the Fourier-based filtering reliably extracts distinguishable high-frequency artifacts rests on the untested assumption that such artifacts survive in videos from Sora and Veo. No ablation or visualization is provided showing that the spectral branch still contributes when these latest models are used; if the artifacts have been suppressed, the gated fusion reduces to a standard semantic classifier and the reported gains may reflect benchmark construction rather than the proposed mechanism.

minor comments (2)

[§3.3] Notation for the Gated Merging Mechanism should be defined with explicit equations rather than descriptive text only.
[Figure 3] Figure captions for spectral visualizations should include the exact frequency cutoff values used in the Fourier filtering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have reviewed each major point carefully and provide point-by-point responses below. We agree that additional details and analyses will strengthen the paper and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the reported accuracies of 87.25% on the new benchmark and 95.59% on public datasets are presented without any description of the experimental protocol, number of samples per generator, train/test splits, baseline methods, or statistical significance tests. This information is load-bearing for the central claim that SpecSem-Net outperforms existing methods on high-fidelity generators.

Authors: We agree that a complete description of the experimental protocol is necessary to support our claims and ensure reproducibility. In the revised manuscript, we will expand §4 to explicitly detail the number of samples per generator in the new benchmark, the train/test split methodology and ratios, the complete list of baseline methods with implementation references, and the results of statistical significance tests (such as McNemar's test or paired t-tests with p-values) comparing SpecSem-Net against the baselines. These elements were part of our experimental design but were not fully elaborated in the original submission; we will now include them. revision: yes
Referee: [§3.2 and §5] §3.2 (Spectral Module) and §5 (Results on Commercial Generators): the claim that the Fourier-based filtering reliably extracts distinguishable high-frequency artifacts rests on the untested assumption that such artifacts survive in videos from Sora and Veo. No ablation or visualization is provided showing that the spectral branch still contributes when these latest models are used; if the artifacts have been suppressed, the gated fusion reduces to a standard semantic classifier and the reported gains may reflect benchmark construction rather than the proposed mechanism.

Authors: We acknowledge the validity of this observation. The contribution of the spectral module on the latest high-fidelity generators requires explicit verification. In the revised manuscript, we will add an ablation study in §5 that isolates the performance of the full SpecSem-Net model versus a semantic-only variant on the commercial generators benchmark. We will also include visualizations of the frequency spectra and filtered features for representative samples from Sora and Veo to demonstrate that the high-frequency branch continues to provide distinguishable information. These additions will clarify the role of the gated merging mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architectural proposal or empirical claims

full rationale

The paper presents SpecSem-Net as a new neural network architecture that extracts high-frequency features via Fourier-Transform filtering and fuses them with semantic context using a Gated Merging Mechanism. No equations, derivations, or first-principles results are described that reduce to fitted parameters or inputs by construction. Performance numbers (87.25% on custom benchmark, 95.59% on public datasets) are reported from direct empirical evaluation rather than any self-referential prediction. The 'first framework' claim and benchmark construction do not invoke self-citations or uniqueness theorems that would create a load-bearing circular chain. The work is self-contained as an empirical architecture proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard signal-processing and deep-learning assumptions plus the domain premise that spectral artifacts persist in high-fidelity generators; no new entities are postulated.

free parameters (1)

neural network hyperparameters and gating thresholds
Typical trainable or hand-chosen parameters in any deep architecture; not enumerated in the abstract.

axioms (1)

domain assumption Detectable high-frequency spectral artifacts exist in outputs of current commercial video generators
Invoked to justify the Fourier filtering module as described in the abstract.

pith-pipeline@v0.9.0 · 5715 in / 1133 out tokens · 53601 ms · 2026-05-20T14:15:44.157051+00:00 · methodology

0 comments

read the original abstract

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

Figures

Figures reproduced from arXiv: 2605.17311 by Huixuaun Zhang, Xiaojun Wan, Zixi Wei.

**Figure 1.** Figure 1: Overview of the proposed SpecSem-Net. (a) The overall dual-stream architecture, comprising a fixed Semantic Branch (Blue) and a trainable Spectral Branch (Green). (b) The Gated Merging Mechanism uses semantic features to dynamically modulate spectral features, filtering out benign environmental noise. (c) The Spectral Feature Extraction module extracts high-frequency residuals via FFT-based high-pass filte… view at source ↗

**Figure 2.** Figure 2: Visualization of the Spark Case. We visualize the feature evolution to demonstrate robustness against environmental noise. (Col 1-2) The high-pass filter inherently captures benign sparks as dominant high-frequency signals. (Col 3) Consequently, the features before the gating mechanism are heavily distracted by this noise. (Col 4) The Gated Merging Mechanism identifies and down-weights these benign texture… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream
cs.CV 2026-07 conditional novelty 6.0

A streaming detector that reads codec motion vectors from the compressed bitstream achieves anytime-valid false-positive control with a single fixed threshold and a priced deferral frontier at about 10^5 MACs per GOP.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Sora 2 is here.https://openai.com/index/sora-2/, 2025

OpenAI. Sora 2 is here.https://openai.com/index/sora-2/, 2025. Accessed: 2026-01-27

work page 2025
[2]

Kling ai: High-quality video generation

Kuaishou. Kling ai: High-quality video generation. https://klingai.com/, 2025. Accessed: 2026-01- 27

work page 2025
[3]

Veo: Our most capable generative video model

Google DeepMind. Veo: Our most capable generative video model. https://deepmind.google/ models/veo/, 2025. Accessed: 2026-01-27

work page 2025
[4]

Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

Ping Liu, Qiqi Tao, and Joey Tianyi Zhou. Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

work page 2025
[5]

The tug-of-war between deepfake generation and detection, 2024

Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war between deepfake generation and detection, 2024

work page 2024
[6]

Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

work page arXiv 2024
[7]

In: arXiv preprint arXiv:2508.00701 (2025)

Chende Zheng, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen, et al. D3: Training-free ai-generated video detection using second-order features.arXiv preprint arXiv:2508.00701, 2025

work page arXiv 2025
[8]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning, pages 3247–3258. PMLR, 2020

work page 2020
[9]

Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7890–7899, 2020

work page 2020
[10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

work page 2021
[11]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

work page 2023
[12]

Ai-generated video forgery detection and authentication

Ayush Kumar Tiwari, Aman Sharma, Poonam Rayakar, Manish Kumar Bhavriya, et al. Ai-generated video forgery detection and authentication. In2024 IEEE 9th International Conference for Convergence in Technology (I2CT), pages 1–8. IEEE, 2024

work page 2024
[13]

Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, and Junfeng Yang. Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

work page 2024
[14]

How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, and Xiaojuan Qi. How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

work page 2025
[15]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[16]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Introducing Gen-3 Alpha: A new frontier for video generation

Runway. Introducing Gen-3 Alpha: A new frontier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha, 2024. Accessed: 2024-06-17

work page 2024
[19]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 10

work page 2021
[20]

Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

work page 2021
[21]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

work page 2022
[22]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.arXiv preprint arXiv:2203.12602, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.arXiv preprint arXiv:2203.12602, 2022

work page arXiv 2022
[23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muham- mad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

work page 2025
[26]

Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

work page 2025
[27]

Ai-generated video detection via spatio-temporal anomaly learning, 2024

Jianfa Bai, Man Lin, and Gang Cao. Ai-generated video detection via spatio-temporal anomaly learning, 2024

work page 2024
[28]

Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

work page 2025
[29]

Generalizing face forgery detection with high- frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high- frequency features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021

work page 2021
[30]

How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

Siwei Lyu and Hany Farid. How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

work page 2005
[31]

Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

work page 2012
[32]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020

work page 2020
[33]

Bihpf: Bilateral high-pass filters for robust deepfake detection

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022

work page 2022
[34]

Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, and Efstratios Gavves. Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[35]

Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

Jiazhen Yan, Ziqiang Li, Fan Wang, Ziwen He, and Zhangjie Fu. Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

work page 2025
[36]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset.arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[37]

Focal frequency loss for image reconstruction and synthesis

Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 13919–13929, 2021

work page 2021
[38]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 11

work page 2018
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[40]

An algorithm for the machine calculation of complex fourier series

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965

work page 1965
[41]

Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

work page 2020
[42]

A closer look at fourier spectrum discrepancies for cnn-generated images detection

Keshigeyan Chandrasegaran, Ngoc-Trung Tran, and Ngai-Man Cheung. A closer look at fourier spectrum discrepancies for cnn-generated images detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7200–7209, 2021

work page 2021
[43]

Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

work page 2016
[44]

What makes fake images detectable? understanding properties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020

work page 2020
[45]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[46]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[47]

Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

work page 2025
[48]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

work page 2026
[49]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

work page 2024
[50]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

work page 2024
[53]

Hailuo ai video.https://hailuoai.com/video, 2024

MiniMax. Hailuo ai video.https://hailuoai.com/video, 2024

work page 2024
[54]

Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

Alibaba Cloud. Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

work page 2023
[55]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[57]

Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

work page 2024
[58]

A golden retriever is running happily across a lush green park while the camera slowly pans to the right

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018. 12 A Robustness in Real-World Scenarios As raised in previous discussions, high-...

work page arXiv 2018

[1] [1]

Sora 2 is here.https://openai.com/index/sora-2/, 2025

OpenAI. Sora 2 is here.https://openai.com/index/sora-2/, 2025. Accessed: 2026-01-27

work page 2025

[2] [2]

Kling ai: High-quality video generation

Kuaishou. Kling ai: High-quality video generation. https://klingai.com/, 2025. Accessed: 2026-01- 27

work page 2025

[3] [3]

Veo: Our most capable generative video model

Google DeepMind. Veo: Our most capable generative video model. https://deepmind.google/ models/veo/, 2025. Accessed: 2026-01-27

work page 2025

[4] [4]

Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

Ping Liu, Qiqi Tao, and Joey Tianyi Zhou. Evolving from single-modal to multi-modal facial deepfake detection: Progress and challenges, 2025

work page 2025

[5] [5]

The tug-of-war between deepfake generation and detection, 2024

Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war between deepfake generation and detection, 2024

work page 2024

[6] [6]

Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

work page arXiv 2024

[7] [7]

In: arXiv preprint arXiv:2508.00701 (2025)

Chende Zheng, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen, et al. D3: Training-free ai-generated video detection using second-order features.arXiv preprint arXiv:2508.00701, 2025

work page arXiv 2025

[8] [8]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning, pages 3247–3258. PMLR, 2020

work page 2020

[9] [9]

Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7890–7899, 2020

work page 2020

[10] [10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

work page 2021

[11] [11]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

work page 2023

[12] [12]

Ai-generated video forgery detection and authentication

Ayush Kumar Tiwari, Aman Sharma, Poonam Rayakar, Manish Kumar Bhavriya, et al. Ai-generated video forgery detection and authentication. In2024 IEEE 9th International Conference for Convergence in Technology (I2CT), pages 1–8. IEEE, 2024

work page 2024

[13] [13]

Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, and Junfeng Yang. Turns out i’m not real: Towards robust detection of ai-generated videos, 2024

work page 2024

[14] [14]

How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, and Xiaojuan Qi. How far are ai-generated videos from simulating the 3d visual world: A learned 3d evaluation approach, 2025

work page 2025

[15] [15]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[16] [16]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Introducing Gen-3 Alpha: A new frontier for video generation

Runway. Introducing Gen-3 Alpha: A new frontier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha, 2024. Accessed: 2024-06-17

work page 2024

[19] [19]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 10

work page 2021

[20] [20]

Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

work page 2021

[21] [21]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

work page 2022

[22] [22]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.arXiv preprint arXiv:2203.12602, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.arXiv preprint arXiv:2203.12602, 2022

work page arXiv 2022

[23] [23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muham- mad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

work page 2025

[26] [26]

Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.Arxiv, 2025

work page 2025

[27] [27]

Ai-generated video detection via spatio-temporal anomaly learning, 2024

Jianfa Bai, Man Lin, and Gang Cao. Ai-generated video detection via spatio-temporal anomaly learning, 2024

work page 2024

[28] [28]

Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection, 2025

work page 2025

[29] [29]

Generalizing face forgery detection with high- frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high- frequency features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021

work page 2021

[30] [30]

How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

Siwei Lyu and Hany Farid. How realistic is photorealistic?IEEE Transactions on Signal Processing, 53(2):845–850, 2005

work page 2005

[31] [31]

Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images.IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012

work page 2012

[32] [32]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020

work page 2020

[33] [33]

Bihpf: Bilateral high-pass filters for robust deepfake detection

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022

work page 2022

[34] [34]

Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, and Efstratios Gavves. Any- resolution ai-generated image detection by spectral learning.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[35] [35]

Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

Jiazhen Yan, Ziqiang Li, Fan Wang, Ziwen He, and Zhangjie Fu. Dual frequency branch framework with reconstructed sliding windows attention for ai-generated image detection, 2025

work page 2025

[36] [36]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset.arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[37] [37]

Focal frequency loss for image reconstruction and synthesis

Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 13919–13929, 2021

work page 2021

[38] [38]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 11

work page 2018

[39] [39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[40] [40]

An algorithm for the machine calculation of complex fourier series

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965

work page 1965

[41] [41]

Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images.Advances in neural information processing systems, 33:3022–3032, 2020

work page 2020

[42] [42]

A closer look at fourier spectrum discrepancies for cnn-generated images detection

Keshigeyan Chandrasegaran, Ngoc-Trung Tran, and Ngai-Man Cheung. A closer look at fourier spectrum discrepancies for cnn-generated images detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7200–7209, 2021

work page 2021

[43] [43]

Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.Distill, 1(10):e3, 2016

work page 2016

[44] [44]

What makes fake images detectable? understanding properties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020

work page 2020

[45] [45]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[46] [46]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[47] [47]

Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A 6-million benchmark for ai-generated video detection, 2025

work page 2025

[48] [48]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2026

work page 2026

[49] [49]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features, 2024

work page 2024

[50] [50]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[51] [51]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

work page 2024

[53] [53]

Hailuo ai video.https://hailuoai.com/video, 2024

MiniMax. Hailuo ai video.https://hailuoai.com/video, 2024

work page 2024

[54] [54]

Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

Alibaba Cloud. Wanx: Alibaba cloud ai video generation.https://wanx.aliyun.com/, 2023

work page 2023

[55] [55]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[57] [57]

Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, 2024

work page 2024

[58] [58]

A golden retriever is running happily across a lush green park while the camera slowly pans to the right

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018. 12 A Robustness in Real-World Scenarios As raised in previous discussions, high-...

work page arXiv 2018