Recognition: no theorem link
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
Pith reviewed 2026-05-16 21:43 UTC · model grok-4.3
The pith
State-of-the-art VLMs cannot reliably detect AI-generated ASMR videos, though humans still can.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Today's video generation models can produce ASMR videos that are difficult for VLMs to distinguish from real ones, while humans can still identify them relatively easily. This holds across a diverse collection of 1,500 high-quality real ASMR videos and 2,235 synthetic counterparts, with an open suite of prompts and reference images that allows the benchmark to grow with new models.
What carries the argument
VideoASMR-Bench, a paired dataset of real and AI-generated ASMR videos that tests fine-grained audio-visual perception and sensory immersion rather than broad semantics.
If this is right
- Video generators are now capable of producing immersive sensory content that evades current VLM detectors.
- Detection methods must shift focus from coarse inconsistencies to low-level audio-visual details.
- The adversarial understanding-generation loop provides a practical way to keep improving both generators and detectors together.
- Existing video benchmarks lack the resolution needed to evaluate fine perceptual fidelity in specialized domains like ASMR.
Where Pith is reading between the lines
- Detection gaps may stem from training data that under-represents quiet, detail-oriented sensory scenes.
- Extending the benchmark to other immersive categories such as guided meditation or slow-motion nature footage could reveal similar weaknesses.
- If generator quality keeps rising, the human advantage in spotting fakes may narrow unless new training signals are introduced.
Load-bearing premise
The chosen real ASMR videos and the prompts used to generate their synthetic matches fairly represent the range of subtle artifacts without favoring either side through curation bias.
What would settle it
A new VLM that reaches high detection accuracy on the full VideoASMR-Bench set while preserving strong performance on standard video understanding benchmarks would falsify the claim that current models cannot reliably detect these artifacts.
Figures
read the original abstract
With AI-generated videos increasingly indistinguishable from reality, current benchmarks primarily focus on broad semantic alignment and basic physical consistency, offering limited discriminative power for evaluating them. To address this, we introduce VideoASMR-Bench, a benchmark based on Autonomous Sensory Meridian Response (ASMR) videos that emphasizes fine-grained audio-visual perception and sensory immersion. This benchmark aims to answer two key questions: (i) Are today's video understanding models (VLMs) sensitive enough to detect AI-generated ASMR videos by recognizing minor visual, physical, or auditory artifacts? (ii) Can today's video generation models (VGMs) produce convincing ASMR videos with immersive experiences? This benchmark comprises a diverse set of 1,500 high-quality real ASMR videos curated from social media, alongside 2,235 synthetic counterparts generated by nine VGMs. Additionally, we open-source an extensible suite of prompts and reference images, enabling the benchmark to scale dynamically with future video models. Moreover, we design an automatic understanding-generation evaluation framework between VGMs and VLMs, where VGMs aim to produce realistic fake videos to fool the VLMs, while the VLMs seek to detect them, forming an adversarial game between the two parties. Our evaluation on VideoASMR-Bench reveals that even state-of-the-art VLMs, such as Gemini-3-Pro, fail to reliably detect AI-generated ASMR videos. Meanwhile, current frontier video generation models can produce ASMR videos that are difficult for VLMs to distinguish from real ones, while humans can still identify them relatively easily.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoASMR-Bench, a benchmark dataset comprising 1,500 high-quality real ASMR videos curated from social media and 2,235 synthetic videos generated by nine video generation models (VGMs). It establishes an adversarial evaluation framework pitting VGMs against video understanding models (VLMs) to assess whether current VLMs can detect subtle artifacts in AI-generated ASMR content and whether VGMs can produce immersive, convincing ASMR videos. The key findings indicate that even advanced VLMs like Gemini-3-Pro struggle to reliably distinguish generated videos from real ones, whereas humans can identify them more easily, and the work provides an open-source extensible prompt suite.
Significance. Should the results be substantiated with detailed metrics and controls, this benchmark would offer significant value by shifting focus from coarse semantic alignment to fine-grained audio-visual and sensory perception in AI video evaluation. It could inform the development of more robust VLMs and realistic VGMs, particularly for niche but perceptually demanding content like ASMR. The adversarial setup and open-sourced elements promote ongoing evaluation as models advance.
major comments (3)
- [Abstract] The abstract reports that SOTA VLMs fail to reliably detect AI-generated ASMR videos but provides no evaluation metrics, per-model breakdowns, statistical tests, or specifics on how detection accuracy was measured, undermining assessment of the central empirical claim.
- [Evaluation] The description of the automatic understanding-generation evaluation framework lacks details on the exact task setup, prompt usage in the adversarial game, and quantitative controls for curation bias in selecting real videos or refining generation prompts.
- [Dataset Construction] No information is given on inter-annotator agreement for curating the 1,500 real videos or metrics for prompt diversity in the 2,235 generated samples, which is critical given the skeptic concern that selection bias may artificially favor VGM performance.
minor comments (2)
- [Abstract] Clarify the exact version or name of the VLM referred to as 'Gemini-3-Pro' for reproducibility.
- [Overall] Include sample frames or links to example videos in the paper to illustrate the fine-grained artifacts discussed.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details on metrics, evaluation protocols, and dataset curation.
read point-by-point responses
-
Referee: [Abstract] The abstract reports that SOTA VLMs fail to reliably detect AI-generated ASMR videos but provides no evaluation metrics, per-model breakdowns, statistical tests, or specifics on how detection accuracy was measured, undermining assessment of the central empirical claim.
Authors: We agree that the abstract would benefit from concise quantitative support for the central claim. The full manuscript reports per-model detection accuracies (e.g., Gemini-3-Pro at 52.3% accuracy) along with statistical significance tests in Section 4 and Table 2. In revision we will add a brief summary of key accuracy figures and the binary classification protocol to the abstract while preserving its length. revision: yes
-
Referee: [Evaluation] The description of the automatic understanding-generation evaluation framework lacks details on the exact task setup, prompt usage in the adversarial game, and quantitative controls for curation bias in selecting real videos or refining generation prompts.
Authors: Section 3.2 describes the adversarial setup in which VLMs perform binary real/fake classification and VGMs generate videos to maximize fooling rate. We will expand this section with the exact prompt templates used for both parties, an example interaction trace, and quantitative controls including prompt lexical diversity scores and category-balance statistics for the real-video curation process. revision: yes
-
Referee: [Dataset Construction] No information is given on inter-annotator agreement for curating the 1,500 real videos or metrics for prompt diversity in the 2,235 generated samples, which is critical given the skeptic concern that selection bias may artificially favor VGM performance.
Authors: We will add a dedicated paragraph in the dataset section reporting inter-annotator agreement (Cohen’s kappa) for the real-video curation and prompt-diversity metrics (unique n-gram coverage and semantic variance) for the generated set. These additions directly address selection-bias concerns. revision: yes
Circularity Check
No significant circularity: purely empirical benchmark with direct observations
full rationale
The paper constructs VideoASMR-Bench by curating 1,500 real ASMR videos from social media and generating 2,235 synthetic videos using nine VGMs, then evaluates detection by VLMs and humans. No derivations, equations, fitted parameters, or predictions appear in the abstract or described framework. The adversarial VGM-vs-VLM setup is a measurement protocol, not a self-referential definition or fit. Results are reported as direct empirical outcomes rather than quantities defined in terms of the inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. This is a standard benchmark paper whose central claims rest on dataset construction and measurement, not on any chain that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ASMR videos emphasize fine-grained audio-visual perception and sensory immersion in a way that exposes artifacts missed by broad semantic benchmarks.
Reference graph
Works this paper leans on
-
[1]
Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 3
work page 2017
-
[2]
Sound- net: Learning sound representations from unlabeled video
Yusuf Aytar, Carl V ondrick, and Antonio Torralba. Sound- net: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29,
-
[3]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 7
work page 2025
-
[4]
Impossible videos.arXiv preprint arXiv:2503.14378, 2025
Zechen Bai, Hai Ci, and Mike Zheng Shou. Impossible videos.arXiv preprint arXiv:2503.14378, 2025. 1, 2, 5
-
[5]
Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1, 2, 3
-
[6]
Demamba: Ai-generated video detection on million-scale genvideo benchmark
Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024. 1, 2
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
deepmind. Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind.google/models/veo/,
-
[9]
The DeepFake Detection Challenge (DFDC) Dataset
Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset.arXiv preprint arXiv:2006.07397, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[10]
Seedance 1.0: Exploring the boundaries of video generation models, 2025
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang,...
work page 2025
-
[11]
Watermarking ai-generated text and video with synthid
Google DeepMind. Watermarking ai-generated text and video with synthid. Blog, 2024. 3
work page 2024
-
[12]
Akio Hayakawa, Masato Ishii, Takashi Shibuya, and Yuki Mitsufuji. Mmdisco: Multi-modal discriminator-guided co- operative diffusion for joint audio and video generation. arXiv preprint arXiv:2405.17842, 2024. 3
-
[13]
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[14]
Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, et al. Step-video-ti2v technical re- port: A state-of-the-art text-driven image-to-video genera- tion model.arXiv preprint arXiv:2503.11251, 2025. 1
-
[15]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 2, 3
work page 2024
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Achhardeep Kaur, Azadeh Noori Hoshyar, Vidya Saikrishna, Selena Firmin, and Feng Xia. Deepfake video detection: challenges and opportunities.Artificial Intelligence Review, 57(6):159, 2024. 2
work page 2024
-
[18]
Fakeavceleb: A novel audio-video multimodal deepfake dataset
H Khalid, S Tariq, M Kim, and SS Woo. Fakeavceleb: A novel audio-video multimodal deepfake dataset. arxiv 2021. arXiv preprint arXiv:2108.05080. 2
-
[19]
Kling ai: Next-generation ai creative studio, 2025
KLING. Kling ai: Next-generation ai creative studio, 2025. 1
work page 2025
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shi- hao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Din- gling Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025. 3
-
[23]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2
work page 2024
-
[24]
Celeb-df: A large-scale challenging dataset for deep- fake forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deep- fake forensics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207– 3216, 2020. 2
work page 2020
-
[25]
Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272,
-
[26]
Detecting multimedia generated by large ai models: A survey.arXiv preprint arXiv:2402.00045, 2024
Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large ai models: A survey.arXiv preprint arXiv:2402.00045, 2024. 2
-
[27]
Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun Nagaraja, Wenwu Wang, Mark D Plumbley, Yangyang Shi, and Vikas Chandra. Syncflow: Toward tem- porally aligned joint audio-video generation from text.arXiv preprint arXiv:2412.15220, 2024. 3
-
[28]
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[29]
Gener- alizing face forgery detection with high-frequency features
Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16317–16326, 2021. 2
work page 2021
-
[30]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xi- aoniu Song, Xing Chen, et al. Step-video-t2v technical re- port: The practice, challenges, and future of video founda- tion model.arXiv preprint arXiv:2502.10248, 2025. 1, 5, 6
-
[31]
Long Ma, Zhiyuan Yan, Qinglang Guo, Yong Liao, Haiyang Yu, and Pengyuan Zhou. Detecting ai-generated video via frame consistency.arXiv preprint arXiv:2402.02085, 2024. 3
-
[32]
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 1, 2
-
[33]
Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yi- ran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical common- sense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024. 1
-
[34]
Spoken moments: Learning joint audio-visual representations from video descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Har- wath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021. 3
work page 2021
-
[35]
Core: Consistent repre- sentation learning for face forgery detection
Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, and Youjian Zhao. Core: Consistent repre- sentation learning for face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12–21, 2022. 2
work page 2022
-
[36]
Genvidbench: A challenging benchmark for detecting ai-generated video
Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai-generated video.arXiv preprint arXiv:2501.11340, 2025. 1
-
[37]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,
- [38]
-
[39]
Sora 2 is here.https://openai.com/index/sora-2/,
OpenAI. Sora 2 is here.https://openai.com/index/sora-2/,
-
[40]
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025. 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Faceforen- sics++: Learning to detect manipulated facial images
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Niessner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 2
work page 2019
-
[42]
Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023. 3
work page 2023
-
[43]
GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali ...
work page 2025
-
[44]
Audio-visual event localization in unconstrained videos
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chen- liang Xu. Audio-visual event localization in unconstrained videos. InProceedings of the European conference on com- puter vision (ECCV), pages 247–263, 2018. 3
work page 2018
-
[45]
Beyond deepfake images: Detecting ai-generated videos
Danial Samadi Vahdati, Tai D Nguyen, Aref Azizpour, and Matthew C Stamm. Beyond deepfake images: Detecting ai-generated videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4397–4408, 2024. 3
work page 2024
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Efficient audio-visual diffusion trans- former for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024. 3
-
[48]
Cnn-generated images are surprisingly easy to spot
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 2
work page 2020
-
[49]
Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, and Emad Barsoum. Keyvid: Keyframe-aware video diffu- sion for audio-synchronized visual animation.arXiv preprint arXiv:2504.09656, 2025. 3
-
[50]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
arXiv preprint arXiv:2410.09732
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models.arXiv preprint arXiv:2410.09732, 2024. 1, 2
-
[53]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl V on- drick, Josh McDermott, and Antonio Torralba. The sound of pixels. InProceedings of the European conference on com- puter vision (ECCV), pages 570–586, 2018. 3
work page 2018
-
[54]
D3: Training-free ai-generated video detection using second-order features
Chende Zheng, Ruiqi Suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, and Chao Shen. D3: Training-free ai-generated video detection using second-order features. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 12852– 12862, 2025. 3
work page 2025
-
[55]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Wilddeepfake: A challenging real-world dataset for deepfake detection
Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. InProceedings of the 28th ACM international conference on multimedia, pages 2382– 2390, 2020. 2 Appendix Contents
work page 2020
-
[57]
Different prompt for video reality test 12
-
[58]
Prompt to get the text description of the ASMR video 13
-
[59]
Visualization Examples. 13
-
[60]
Detailed experiments 13 9.1. Evaliuation metric . . . . . . . . . . . . . . 13 9.2. Hard-level results for Video Reality Test . . . 13
-
[61]
Different prompt for video reality test We use the following prompt as the default prompt for the experiments in the main paper: Prompt for detecting real or fake. Given the following video, please determine if the video is real or fake. First, think about the reasoning process, and then provide the an- swer. The reasoning process should be en- closed wit...
-
[62]
Prompt to get the text description of the ASMR video Prompt for getting storyboard of AMSR video Given 8 frames evenly sampled from an ASMR video, describe the overall scene in a single, con- tinuous paragraph. Integrate information about the visual environment (background, lighting, tex- tures, mood), the main subjects or objects, their ac- tions and tem...
-
[63]
Visualization Examples. Below we show the video frames and the reasoning process along with the answers for qualitative analysis, including: • Figure 7: Veo3.1-fast generated videos with and without audio evaluation, where with audio is detected to be gen- erated, while without audio is real. • Figure 8: Sora2 generated videos with and without audio evalu...
-
[64]
Detailed experiments 9.1. Evaliuation metric For the video understanding modelf, it predict the answer givennreal videos and it inducedngenerated fake videos. We extract the number within the<answer>number</an- swer>by regular pattern match. accuracy= correct detected answers number total valid answers number For the video generation modelg, it predictnge...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.