Chameleon: Benchmarking Detection and Backtracking on Commercial-Grade AI-Generated Videos
Pith reviewed 2026-05-23 00:34 UTC · model grok-4.3
The pith
Existing methods have critical limitations in detecting and backtracking videos from commercial closed-source AI models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chameleon is a commercial-grade dataset that reveals existing AI-generated video detection methods exhibit critical limitations when applied to high-fidelity, spatiotemporally consistent videos produced by commercial closed-source models, moving the field toward scene-level forensic analysis.
What carries the argument
The Chameleon benchmark with its 1,700 videos, 3D consistency metrics for dynamic scene spatial coherence, and annotations supporting detection and backtracking tasks.
If this is right
- AI video detection must incorporate holistic scene analysis rather than relying primarily on face forgeries.
- Forensic backtracking of original sources becomes an essential capability for video authentication.
- Methods need to handle the higher realism and temporal coherence typical of closed-source commercial generators.
- New benchmarks like this will drive development of more advanced detection techniques for AIGC security.
- Current detection approaches require reevaluation of their reasoning processes on high-quality content.
Where Pith is reading between the lines
- Developing detectors that explicitly model 3D scene consistency could address some of the identified limitations.
- The benchmark could be extended to include videos from emerging commercial models to keep pace with technology.
- Integration of backtracking with detection might enable more comprehensive forensic pipelines in practice.
- Similar benchmarks in other media types like audio or images could follow this approach for commercial-grade content.
Load-bearing premise
The collected videos from commercial sources and their annotations accurately represent real-world conditions and provide reliable ground truth for forensic evaluation.
What would settle it
Demonstrating that a state-of-the-art detector achieves high accuracy on the Chameleon dataset or a similar set of commercial videos would challenge the claim of critical limitations.
Figures
read the original abstract
The proliferation of AI-Generated Content (AIGC), especially deepfake videos, poses a severe threat to social trust by enabling fraud, privacy violations and disinformation. Existing AI-generated video detection (AGVD) benchmarks focus on open-source model generated videos, yet commercial closed-source models produce more realistic, temporally coherent videos that are underexplored in detection research. To fill this gap, we present Chameleon, a commercial-grade dataset with 1,700 AI-generated videos from 600 real-world sources across three key domains (News, Speech, Recommendation), featuring high resolution, rich annotations and 3D consistency metrics for dynamic scene spatial coherence, shifting detection from face-centric forgery to holistic scene forensics. This benchmark assesses models on two core tasks: accurate AI video detection in real-world conditions and forensic backtracking of original sources. Experimental results reveal critical limitations of existing methods in detecting and backtracking high-fidelity, spatiotemporally consistent videos from commercial closed-source models, highlighting current methods' flawed forensic reasoning and establishing Chameleon as a vital challenge for AIGC security research. The code and data are available at https://github.com/lxixim/Chameleon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Chameleon, a dataset of 1,700 high-resolution AI-generated videos sourced from 600 real-world commercial closed-source models across News, Speech, and Recommendation domains. It supplies rich annotations and 3D consistency metrics intended to capture dynamic scene spatial coherence, and evaluates existing detection methods on two tasks: AI video detection under real-world conditions and forensic backtracking to original sources. The central claim is that current methods exhibit critical limitations on high-fidelity, spatiotemporally consistent commercial outputs, establishing the dataset as a new challenge for AIGC forensics.
Significance. If the annotations and 3D metrics are shown to be reliable, the dataset would provide a valuable, domain-diverse benchmark that moves AGVD research beyond open-source generators toward realistic commercial content. The public release of code and data at https://github.com/lxixim/Chameleon is a clear strength for reproducibility.
major comments (2)
- [Methods] The manuscript provides no description of the annotation process for source backtracking labels, inter-annotator agreement, or any validation of the 3D consistency metrics against ground-truth geometry or human forensic judgments. This is load-bearing for the claims in the Experiments section that performance drops reflect inherent method flaws rather than benchmark artifacts.
- [Experiments] The evaluation in the Experiments section reports performance limitations without statistical controls, confidence intervals, or ablation on annotation reliability, undermining the assertion that existing detectors have 'flawed forensic reasoning' on commercial videos.
minor comments (1)
- [Abstract] The abstract states that the dataset features 'rich annotations' but does not enumerate their contents (e.g., bounding boxes, source IDs, consistency scores); a brief list would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the methodological transparency and statistical rigor of the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Methods] The manuscript provides no description of the annotation process for source backtracking labels, inter-annotator agreement, or any validation of the 3D consistency metrics against ground-truth geometry or human forensic judgments. This is load-bearing for the claims in the Experiments section that performance drops reflect inherent method flaws rather than benchmark artifacts.
Authors: We agree that explicit documentation of the annotation pipeline is required. In the revised manuscript we will insert a new subsection detailing: (i) the protocol for obtaining source backtracking labels (direct metadata from the 600 commercial providers where disclosed, supplemented by reverse-image search for the remainder), (ii) the annotation guidelines and training given to the three independent annotators, and (iii) the resulting inter-annotator agreement (Cohen’s kappa). For the 3D consistency metrics we will add a validation experiment that correlates the metrics against human forensic judgments on a 200-video stratified subset. Ground-truth 3D geometry cannot be obtained because the videos originate from closed-source commercial generators that do not expose scene parameters; we therefore treat human expert agreement as the appropriate proxy validation. revision: yes
-
Referee: [Experiments] The evaluation in the Experiments section reports performance limitations without statistical controls, confidence intervals, or ablation on annotation reliability, undermining the assertion that existing detectors have 'flawed forensic reasoning' on commercial videos.
Authors: We accept that the current experimental presentation would benefit from additional statistical safeguards. The revised Experiments section will report 95 % bootstrap confidence intervals for every detection and backtracking metric, include paired significance tests (McNemar and Wilcoxon) between methods, and add an ablation that varies the annotation-reliability threshold and re-computes performance drops. These changes will provide quantitative support for the claim that observed limitations stem from method shortcomings rather than benchmark noise. revision: yes
Circularity Check
No circularity; benchmark dataset release with independent evaluation
full rationale
The paper's contribution is the collection and release of a new dataset (1,700 videos, annotations, 3D metrics) from commercial sources, followed by empirical evaluation of existing detectors on detection and backtracking tasks. No derivation chain, equations, fitted parameters, or predictions are present that could reduce to self-definition, fitted inputs, or self-citation load-bearing. The central claims rest on the new data itself rather than any internal reduction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Commercial closed-source models produce more realistic and temporally coherent videos than open-source models.
Reference graph
Works this paper leans on
-
[1]
Innovative ai solutions for pneumonia detection: Exploring densenet-161 in medi- cal imaging
Ruchika Bhuria and Sheifali Gupta. Innovative ai solutions for pneumonia detection: Exploring densenet-161 in medi- cal imaging. In 2024 5th International Conference on Data Intelligence and Cognitive Informatics (ICDICI), pages 638–
work page 2024
-
[2]
Chaka Chaka. Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6(2):94–104, 2023
work page 2023
-
[3]
Fakecatcher: Detection of synthetic portrait videos using biological sig- nals
Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological sig- nals. IEEE transactions on pattern analysis and machine intelligence, 2020
work page 2020
-
[4]
Raising the bar of ai-generated image detection with CLIP
Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with CLIP. In IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024 , pages 4356–4366. IEEE, 2024
work page 2024
-
[5]
The deepfake detection chal- lenge (DFDC) preview dataset
Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton-Ferrer. The deepfake detection chal- lenge (DFDC) preview dataset. CoRR, abs/1910.08854, 2019
-
[6]
Enabling ai- generated content services in wireless edge networks
Hongyang Du, Zonghang Li, Dusit Niyato, Jiawen Kang, Ze- hui Xiong, Xuemin Shen, and Dong In Kim. Enabling ai- generated content services in wireless edge networks. IEEE Wirel. Commun., 31(3):226–234, 2024
work page 2024
-
[7]
Delving into the local: Dynamic in- consistency learning for deepfake video detection
Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma. Delving into the local: Dynamic in- consistency learning for deepfake video detection. In Thirty- Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Ar- tificial Intelligence, IAAI 2022, The Twelveth Symposium on Ed...
work page 2022
-
[8]
Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion
Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 20606–20615, 2023
work page 2023
-
[9]
Biscope: Ai-generated text detection by checking memorization of preceding tokens
Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanhong Tao, Guangyu Shen, and Xi- angyu Zhang. Biscope: Ai-generated text detection by checking memorization of preceding tokens. In Advances in Neural Information Processing Systems 38: Annual Con- ference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada...
work page 2024
-
[10]
Detective: Detect- ing ai-generated text via multi-level contrastive learning
Xun Guo, Yongxin He, Shan Zhang, Ting Zhang, Wanquan Feng, Haibin Huang, and Chongyang Ma. Detective: Detect- ing ai-generated text via multi-level contrastive learning. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024
work page 2024
-
[11]
Anna Yoo Jeong Ha, Josephine Passananti, Ronik Bhaskar, Shawn Shan, Reid Southen, Hai-Tao Zheng, and Ben Y . Zhao. Organic or diffused: Can we distinguish human art from ai-generated images? In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communica- tions Security, CCS 2024, Salt Lake City, UT, USA, October 14-18, 2024, pages 4822–4836...
work page 2024
-
[12]
Chatgpt for shaping the future of dentistry: The potential of multi-modal large lan- guage model
Hanyao Huang, Ou Zheng, Dongdong Wang, Jiayi Yin, Zi- jin Wang, Shengxuan Ding, Heng Yin, Chuan Xu, Renjie Yang, Qian Zheng, and Bing Shi. Chatgpt for shaping the future of dentistry: The potential of multi-modal large lan- guage model. CoRR, abs/2304.03086, 2023
-
[13]
Gpt-4o: The cutting-edge advancement in multimodal llm
Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints, 2024
work page 2024
-
[14]
Shan Jia, Reilin Lyu, Kangran Zhao, Yize Chen, Zhiyuan Yan, Yan Ju, Chuanbo Hu, Xin Li, Baoyuan Wu, and Siwei Lyu. Can chatgpt detect deepfakes? a study of using mul- timodal large language models for media forensics. CoRR, abs/2403.14077, 2024
-
[15]
Evading watermark based detection of ai-generated content
Zhengyuan Jiang, Jinghuai Zhang, and Neil Zhenqiang Gong. Evading watermark based detection of ai-generated content. In Proceedings of the 2023 ACM SIGSAC Con- ference on Computer and Communications Security , page 1168–1181, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[16]
Revisiting generalizability in deepfake detection: Improving metrics and stabilizing transfer
Sarthak Kamat, Shruti Agarwal, Trevor Darrell, and Anna Rohrbach. Revisiting generalizability in deepfake detection: Improving metrics and stabilizing transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 426–435, 2023
work page 2023
-
[17]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, 2023
work page 2023
-
[18]
Learning to prompt with text only supervision for vision- language models
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, and Federico Tombari. Learning to prompt with text only supervision for vision- language models. CoRR, abs/2401.02418, 2024
-
[19]
Cored: Gen- eralizing fake media detection with continual representation using distillation
Minha Kim, Shahroz Tariq, and Simon S Woo. Cored: Gen- eralizing fake media detection with continual representation using distillation. In Proceedings of the 29th ACM Interna- tional Conference on Multimedia, pages 337–346, 2021
work page 2021
-
[20]
Brett Koonce. Efficientnet. In Convolutional neural net- works with swift for Tensorflow: image recognition and dataset categorization, pages 109–123. Springer, 2021
work page 2021
- [21]
-
[22]
Ryo Kurokawa, Yuji Ohizumi, Jun Kanzawa, Mariko Kurokawa, Yuki Sonoda, Yuta Nakamura, Takao Kiguchi, Wataru Gonoi, and Osamu Abe. Diagnostic performances of claude 3 opus and claude 3.5 sonnet from patient history and key images in radiology’s “diagnosis please” cases.Japanese Journal of Radiology, pages 1–4, 2024
work page 2024
-
[23]
Faster than lies: Real-time deepfake detection using binary neural networks
Romeo Lanzino, Federico Fontana, Anxhelo Diko, Marco Raoul Marini, and Luigi Cinque. Faster than lies: Real-time deepfake detection using binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3771–3780, 2024
work page 2024
-
[24]
A continual deepfake detection benchmark: Dataset, methods, and essentials
Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. In Proceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1339–1349, 2023
work page 2023
-
[25]
In ictu oculi: Exposing ai created fake videos by detecting eye blinking
Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International workshop on information forensics and security (WIFS), pages 1–7. Ieee, 2018
work page 2018
-
[26]
Celeb-df: A large-scale challenging dataset for deep- fake forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deep- fake forensics. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 3204–3213. Computer Vision Foundation / IEEE, 2020
work page 2020
-
[27]
A survey of multimodel large language models
Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, page 405–409, New York, NY , USA, 2024. As- sociation for Computing Machinery
work page 2024
-
[28]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models. CoRR, abs/2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
work page 2021
-
[30]
Classi- fication of human- and ai-generated texts: Investigating fea- tures for chatgpt
Lorenz Mindner, Tim Schlippe, and Kristina Schaaff. Classi- fication of human- and ai-generated texts: Investigating fea- tures for chatgpt. CoRR, abs/2308.05341, 2023
-
[31]
Anisha Pal, Julia Kruk, Mansi Phute, Manognya Bhattaram, Diyi Yang, Duen Horng Chau, and Judy Hoffman. Semi- truths: A large-scale dataset of ai-augmented images for evaluating robustness of ai-generated image detectors. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024...
work page 2024
-
[32]
Faceforen- sics++: Learning to detect manipulated facial images
Andreas R ¨ossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In 2019 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2019, Seoul, Korea (South), October 27 - Novem- ber 2, 2019, pages 1–11. IEEE, 2019
work page 2019
-
[33]
Huimin She, Yongjian Hu, Beibei Liu, Jicheng Li, and Chang-Tsun Li. Using graph neural networks to improve generalization capability of the models for deepfake detec- tion. IEEE Transactions on Information Forensics and Secu- rity, 2024
work page 2024
-
[34]
Chuangchuang Tan, Huan Liu, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024 , pages 28130–28139. IEEE, 2024
work page 2024
-
[35]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024
work page 2024
-
[36]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Detecting affect states using vgg16, resnet50 and se-resnet50 networks
Dhananjay Theckedath and RR Sedamkar. Detecting affect states using vgg16, resnet50 and se-resnet50 networks. SN Computer Science, 1(2):79, 2020
work page 2020
-
[38]
Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. In European conference on computer vision, pages 516–533. Springer, 2022
work page 2022
-
[39]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[40]
Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(3):3121–3138, 2023
work page 2023
-
[41]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using VQ-V AE and transformers. CoRR, abs/2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Toward the third gener- ation artificial intelligence
Bo Zhang, Jun Zhu, and Hang Su. Toward the third gener- ation artificial intelligence. Science China Information Sci- ences, 66(2):121101, 2023
work page 2023
-
[43]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022
work page 2022
-
[44]
Zhehao Zhang, Weicheng Ma, and Soroush V osoughi. Is gpt-4v (ision) all you need for automating academic data vi- sualization? exploring vision-language models’ capability in reproducing academic charts. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 8271– 8288, 2024. 10
work page 2024
-
[45]
Breaking semantic ar- tifacts for generalized ai-generated image detection
Chende Zheng, Chenhao Lin, Zhengyu Zhao, Hang Wang, Xu Guo, Shuai Liu, and Chao Shen. Breaking semantic ar- tifacts for generalized ai-generated image detection. In Ad- vances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024
work page 2024
-
[46]
News, Speech, and Recommendation
Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM international conference on multimedia , pages 2382– 2390, 2020. 11 Supplementary Materials A. Construction Algorithm of the Chameleon As stated in the main paper, the construction of ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.