pith. sign in

arxiv: 2503.06624 · v2 · submitted 2025-03-09 · 💻 cs.CV

Chameleon: Benchmarking Detection and Backtracking on Commercial-Grade AI-Generated Videos

Pith reviewed 2026-05-23 00:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated video detectiondeepfake detectioncommercial AI modelsvideo forensicsbenchmarkbacktrackingspatiotemporal consistencyscene forensics
0
0 comments X

The pith

Existing methods have critical limitations in detecting and backtracking videos from commercial closed-source AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chameleon, a new benchmark dataset containing 1,700 high-fidelity AI-generated videos sourced from 600 real-world commercial closed-source models in news, speech, and recommendation domains. It shifts the focus of detection research from open-source model outputs and face-centric forgeries to holistic scene forensics using 3D consistency metrics and rich annotations. The benchmark evaluates models on detection accuracy under real-world conditions and on forensic backtracking to original sources. Experimental results demonstrate that current methods struggle with the spatiotemporal consistency of these commercial videos and exhibit flawed forensic reasoning. This establishes a new challenge for developing robust AIGC security tools.

Core claim

Chameleon is a commercial-grade dataset that reveals existing AI-generated video detection methods exhibit critical limitations when applied to high-fidelity, spatiotemporally consistent videos produced by commercial closed-source models, moving the field toward scene-level forensic analysis.

What carries the argument

The Chameleon benchmark with its 1,700 videos, 3D consistency metrics for dynamic scene spatial coherence, and annotations supporting detection and backtracking tasks.

If this is right

  • AI video detection must incorporate holistic scene analysis rather than relying primarily on face forgeries.
  • Forensic backtracking of original sources becomes an essential capability for video authentication.
  • Methods need to handle the higher realism and temporal coherence typical of closed-source commercial generators.
  • New benchmarks like this will drive development of more advanced detection techniques for AIGC security.
  • Current detection approaches require reevaluation of their reasoning processes on high-quality content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developing detectors that explicitly model 3D scene consistency could address some of the identified limitations.
  • The benchmark could be extended to include videos from emerging commercial models to keep pace with technology.
  • Integration of backtracking with detection might enable more comprehensive forensic pipelines in practice.
  • Similar benchmarks in other media types like audio or images could follow this approach for commercial-grade content.

Load-bearing premise

The collected videos from commercial sources and their annotations accurately represent real-world conditions and provide reliable ground truth for forensic evaluation.

What would settle it

Demonstrating that a state-of-the-art detector achieves high accuracy on the Chameleon dataset or a similar set of commercial videos would challenge the claim of critical limitations.

Figures

Figures reproduced from arXiv: 2503.06624 by Aimin Yang, Canyu Chen, Meiyu Zeng, Nankai Lin, Xingming Liao, Zhuowei Wang.

Figure 1
Figure 1. Figure 1: Examples of Chameleon. Frame sequences are used to show the dynamics of the video. Subfigure (a) represents real-world [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework building for Chameleon. Blue for AI [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curve compares the performance of different mod [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of different methods across categories and generation techniques with confidence thresholds equal to 1.0. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of LVMs detection of AI-generated and real-world video frames in the Chameleon dataset. The left section presents [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of real-world video and AI-generated video in the News category in Chameleon. The video is presented in ten orderly [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of real-world video and AI-generated video in the Speech category in Chameleon. The video is presented in ten [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of real-world video and AI-generated video in the Recommendation category in Chameleon. The video is presented in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of partial frames extracted from videos generated by three techniques (Kling, Runway Gen 3, Jimeng). [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: ROC curves of different categories and AI generation techniques. The first row presents ROC curves for the three categories [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prediction accuracies of various models for different categories of videos and using different AI generation techniques [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

The proliferation of AI-Generated Content (AIGC), especially deepfake videos, poses a severe threat to social trust by enabling fraud, privacy violations and disinformation. Existing AI-generated video detection (AGVD) benchmarks focus on open-source model generated videos, yet commercial closed-source models produce more realistic, temporally coherent videos that are underexplored in detection research. To fill this gap, we present Chameleon, a commercial-grade dataset with 1,700 AI-generated videos from 600 real-world sources across three key domains (News, Speech, Recommendation), featuring high resolution, rich annotations and 3D consistency metrics for dynamic scene spatial coherence, shifting detection from face-centric forgery to holistic scene forensics. This benchmark assesses models on two core tasks: accurate AI video detection in real-world conditions and forensic backtracking of original sources. Experimental results reveal critical limitations of existing methods in detecting and backtracking high-fidelity, spatiotemporally consistent videos from commercial closed-source models, highlighting current methods' flawed forensic reasoning and establishing Chameleon as a vital challenge for AIGC security research. The code and data are available at https://github.com/lxixim/Chameleon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Chameleon, a dataset of 1,700 high-resolution AI-generated videos sourced from 600 real-world commercial closed-source models across News, Speech, and Recommendation domains. It supplies rich annotations and 3D consistency metrics intended to capture dynamic scene spatial coherence, and evaluates existing detection methods on two tasks: AI video detection under real-world conditions and forensic backtracking to original sources. The central claim is that current methods exhibit critical limitations on high-fidelity, spatiotemporally consistent commercial outputs, establishing the dataset as a new challenge for AIGC forensics.

Significance. If the annotations and 3D metrics are shown to be reliable, the dataset would provide a valuable, domain-diverse benchmark that moves AGVD research beyond open-source generators toward realistic commercial content. The public release of code and data at https://github.com/lxixim/Chameleon is a clear strength for reproducibility.

major comments (2)
  1. [Methods] The manuscript provides no description of the annotation process for source backtracking labels, inter-annotator agreement, or any validation of the 3D consistency metrics against ground-truth geometry or human forensic judgments. This is load-bearing for the claims in the Experiments section that performance drops reflect inherent method flaws rather than benchmark artifacts.
  2. [Experiments] The evaluation in the Experiments section reports performance limitations without statistical controls, confidence intervals, or ablation on annotation reliability, undermining the assertion that existing detectors have 'flawed forensic reasoning' on commercial videos.
minor comments (1)
  1. [Abstract] The abstract states that the dataset features 'rich annotations' but does not enumerate their contents (e.g., bounding boxes, source IDs, consistency scores); a brief list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the methodological transparency and statistical rigor of the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Methods] The manuscript provides no description of the annotation process for source backtracking labels, inter-annotator agreement, or any validation of the 3D consistency metrics against ground-truth geometry or human forensic judgments. This is load-bearing for the claims in the Experiments section that performance drops reflect inherent method flaws rather than benchmark artifacts.

    Authors: We agree that explicit documentation of the annotation pipeline is required. In the revised manuscript we will insert a new subsection detailing: (i) the protocol for obtaining source backtracking labels (direct metadata from the 600 commercial providers where disclosed, supplemented by reverse-image search for the remainder), (ii) the annotation guidelines and training given to the three independent annotators, and (iii) the resulting inter-annotator agreement (Cohen’s kappa). For the 3D consistency metrics we will add a validation experiment that correlates the metrics against human forensic judgments on a 200-video stratified subset. Ground-truth 3D geometry cannot be obtained because the videos originate from closed-source commercial generators that do not expose scene parameters; we therefore treat human expert agreement as the appropriate proxy validation. revision: yes

  2. Referee: [Experiments] The evaluation in the Experiments section reports performance limitations without statistical controls, confidence intervals, or ablation on annotation reliability, undermining the assertion that existing detectors have 'flawed forensic reasoning' on commercial videos.

    Authors: We accept that the current experimental presentation would benefit from additional statistical safeguards. The revised Experiments section will report 95 % bootstrap confidence intervals for every detection and backtracking metric, include paired significance tests (McNemar and Wilcoxon) between methods, and add an ablation that varies the annotation-reliability threshold and re-computes performance drops. These changes will provide quantitative support for the claim that observed limitations stem from method shortcomings rather than benchmark noise. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark dataset release with independent evaluation

full rationale

The paper's contribution is the collection and release of a new dataset (1,700 videos, annotations, 3D metrics) from commercial sources, followed by empirical evaluation of existing detectors on detection and backtracking tasks. No derivation chain, equations, fitted parameters, or predictions are present that could reduce to self-definition, fitted inputs, or self-citation load-bearing. The central claims rest on the new data itself rather than any internal reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the commercial video collection and the validity of the 3D consistency metrics as forensic signals.

axioms (1)
  • domain assumption Commercial closed-source models produce more realistic and temporally coherent videos than open-source models.
    Stated as the core motivation for creating the benchmark.

pith-pipeline@v0.9.0 · 5757 in / 971 out tokens · 51043 ms · 2026-05-23T00:34:27.595612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Innovative ai solutions for pneumonia detection: Exploring densenet-161 in medi- cal imaging

    Ruchika Bhuria and Sheifali Gupta. Innovative ai solutions for pneumonia detection: Exploring densenet-161 in medi- cal imaging. In 2024 5th International Conference on Data Intelligence and Cognitive Informatics (ICDICI), pages 638–

  2. [2]

    Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools

    Chaka Chaka. Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6(2):94–104, 2023

  3. [3]

    Fakecatcher: Detection of synthetic portrait videos using biological sig- nals

    Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological sig- nals. IEEE transactions on pattern analysis and machine intelligence, 2020

  4. [4]

    Raising the bar of ai-generated image detection with CLIP

    Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with CLIP. In IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024 , pages 4356–4366. IEEE, 2024

  5. [5]

    The deepfake detection chal- lenge (DFDC) preview dataset

    Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton-Ferrer. The deepfake detection chal- lenge (DFDC) preview dataset. CoRR, abs/1910.08854, 2019

  6. [6]

    Enabling ai- generated content services in wireless edge networks

    Hongyang Du, Zonghang Li, Dusit Niyato, Jiawen Kang, Ze- hui Xiong, Xuemin Shen, and Dong In Kim. Enabling ai- generated content services in wireless edge networks. IEEE Wirel. Commun., 31(3):226–234, 2024

  7. [7]

    Delving into the local: Dynamic in- consistency learning for deepfake video detection

    Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma. Delving into the local: Dynamic in- consistency learning for deepfake video detection. In Thirty- Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Ar- tificial Intelligence, IAAI 2022, The Twelveth Symposium on Ed...

  8. [8]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 20606–20615, 2023

  9. [9]

    Biscope: Ai-generated text detection by checking memorization of preceding tokens

    Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanhong Tao, Guangyu Shen, and Xi- angyu Zhang. Biscope: Ai-generated text detection by checking memorization of preceding tokens. In Advances in Neural Information Processing Systems 38: Annual Con- ference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada...

  10. [10]

    Detective: Detect- ing ai-generated text via multi-level contrastive learning

    Xun Guo, Yongxin He, Shan Zhang, Ting Zhang, Wanquan Feng, Haibin Huang, and Chongyang Ma. Detective: Detect- ing ai-generated text via multi-level contrastive learning. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

  11. [11]

    Anna Yoo Jeong Ha, Josephine Passananti, Ronik Bhaskar, Shawn Shan, Reid Southen, Hai-Tao Zheng, and Ben Y . Zhao. Organic or diffused: Can we distinguish human art from ai-generated images? In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communica- tions Security, CCS 2024, Salt Lake City, UT, USA, October 14-18, 2024, pages 4822–4836...

  12. [12]

    Chatgpt for shaping the future of dentistry: The potential of multi-modal large lan- guage model

    Hanyao Huang, Ou Zheng, Dongdong Wang, Jiayi Yin, Zi- jin Wang, Shengxuan Ding, Heng Yin, Chuan Xu, Renjie Yang, Qian Zheng, and Bing Shi. Chatgpt for shaping the future of dentistry: The potential of multi-modal large lan- guage model. CoRR, abs/2304.03086, 2023

  13. [13]

    Gpt-4o: The cutting-edge advancement in multimodal llm

    Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints, 2024

  14. [14]

    Can chatgpt detect deepfakes? a study of using mul- timodal large language models for media forensics

    Shan Jia, Reilin Lyu, Kangran Zhao, Yize Chen, Zhiyuan Yan, Yan Ju, Chuanbo Hu, Xin Li, Baoyuan Wu, and Siwei Lyu. Can chatgpt detect deepfakes? a study of using mul- timodal large language models for media forensics. CoRR, abs/2403.14077, 2024

  15. [15]

    Evading watermark based detection of ai-generated content

    Zhengyuan Jiang, Jinghuai Zhang, and Neil Zhenqiang Gong. Evading watermark based detection of ai-generated content. In Proceedings of the 2023 ACM SIGSAC Con- ference on Computer and Communications Security , page 1168–1181, New York, NY , USA, 2023. Association for Computing Machinery

  16. [16]

    Revisiting generalizability in deepfake detection: Improving metrics and stabilizing transfer

    Sarthak Kamat, Shruti Agarwal, Trevor Darrell, and Anna Rohrbach. Revisiting generalizability in deepfake detection: Improving metrics and stabilizing transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 426–435, 2023

  17. [17]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, 2023

  18. [18]

    Learning to prompt with text only supervision for vision- language models

    Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, and Federico Tombari. Learning to prompt with text only supervision for vision- language models. CoRR, abs/2401.02418, 2024

  19. [19]

    Cored: Gen- eralizing fake media detection with continual representation using distillation

    Minha Kim, Shahroz Tariq, and Simon S Woo. Cored: Gen- eralizing fake media detection with continual representation using distillation. In Proceedings of the 29th ACM Interna- tional Conference on Multimedia, pages 337–346, 2021

  20. [20]

    Efficientnet

    Brett Koonce. Efficientnet. In Convolutional neural net- works with swift for Tensorflow: image recognition and dataset categorization, pages 109–123. Springer, 2021

  21. [21]

    Resnet 50

    Brett Koonce and Brett Koonce. Resnet 50. Convolutional 9 neural networks with swift for tensorflow: image recognition and dataset categorization, pages 63–72, 2021

  22. [22]

    diagnosis please

    Ryo Kurokawa, Yuji Ohizumi, Jun Kanzawa, Mariko Kurokawa, Yuki Sonoda, Yuta Nakamura, Takao Kiguchi, Wataru Gonoi, and Osamu Abe. Diagnostic performances of claude 3 opus and claude 3.5 sonnet from patient history and key images in radiology’s “diagnosis please” cases.Japanese Journal of Radiology, pages 1–4, 2024

  23. [23]

    Faster than lies: Real-time deepfake detection using binary neural networks

    Romeo Lanzino, Federico Fontana, Anxhelo Diko, Marco Raoul Marini, and Luigi Cinque. Faster than lies: Real-time deepfake detection using binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3771–3780, 2024

  24. [24]

    A continual deepfake detection benchmark: Dataset, methods, and essentials

    Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. In Proceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1339–1349, 2023

  25. [25]

    In ictu oculi: Exposing ai created fake videos by detecting eye blinking

    Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International workshop on information forensics and security (WIFS), pages 1–7. Ieee, 2018

  26. [26]

    Celeb-df: A large-scale challenging dataset for deep- fake forensics

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deep- fake forensics. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 3204–3213. Computer Vision Foundation / IEEE, 2020

  27. [27]

    A survey of multimodel large language models

    Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, page 405–409, New York, NY , USA, 2024. As- sociation for Computing Machinery

  28. [28]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models. CoRR, abs/2402.17177, 2024

  29. [29]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  30. [30]

    Classi- fication of human- and ai-generated texts: Investigating fea- tures for chatgpt

    Lorenz Mindner, Tim Schlippe, and Kristina Schaaff. Classi- fication of human- and ai-generated texts: Investigating fea- tures for chatgpt. CoRR, abs/2308.05341, 2023

  31. [31]

    Semi- truths: A large-scale dataset of ai-augmented images for evaluating robustness of ai-generated image detectors

    Anisha Pal, Julia Kruk, Mansi Phute, Manognya Bhattaram, Diyi Yang, Duen Horng Chau, and Judy Hoffman. Semi- truths: A large-scale dataset of ai-augmented images for evaluating robustness of ai-generated image detectors. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024...

  32. [32]

    Faceforen- sics++: Learning to detect manipulated facial images

    Andreas R ¨ossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In 2019 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2019, Seoul, Korea (South), October 27 - Novem- ber 2, 2019, pages 1–11. IEEE, 2019

  33. [33]

    Using graph neural networks to improve generalization capability of the models for deepfake detec- tion

    Huimin She, Yongjian Hu, Beibei Liu, Jicheng Li, and Chang-Tsun Li. Using graph neural networks to improve generalization capability of the models for deepfake detec- tion. IEEE Transactions on Information Forensics and Secu- rity, 2024

  34. [34]

    Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Huan Liu, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024 , pages 28130–28139. IEEE, 2024

  35. [35]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024

  36. [36]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  37. [37]

    Detecting affect states using vgg16, resnet50 and se-resnet50 networks

    Dhananjay Theckedath and RR Sedamkar. Detecting affect states using vgg16, resnet50 and se-resnet50 networks. SN Computer Science, 1(2):79, 2020

  38. [38]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. In European conference on computer vision, pages 516–533. Springer, 2022

  39. [39]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  40. [40]

    Gan inversion: A survey

    Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(3):3121–3138, 2023

  41. [41]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using VQ-V AE and transformers. CoRR, abs/2104.10157, 2021

  42. [42]

    Toward the third gener- ation artificial intelligence

    Bo Zhang, Jun Zhu, and Hang Su. Toward the third gener- ation artificial intelligence. Science China Information Sci- ences, 66(2):121101, 2023

  43. [43]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022

  44. [44]

    Is gpt-4v (ision) all you need for automating academic data vi- sualization? exploring vision-language models’ capability in reproducing academic charts

    Zhehao Zhang, Weicheng Ma, and Soroush V osoughi. Is gpt-4v (ision) all you need for automating academic data vi- sualization? exploring vision-language models’ capability in reproducing academic charts. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 8271– 8288, 2024. 10

  45. [45]

    Breaking semantic ar- tifacts for generalized ai-generated image detection

    Chende Zheng, Chenhao Lin, Zhengyu Zhao, Hang Wang, Xu Guo, Shuai Liu, and Chao Shen. Breaking semantic ar- tifacts for generalized ai-generated image detection. In Ad- vances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

  46. [46]

    News, Speech, and Recommendation

    Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM international conference on multimedia , pages 2382– 2390, 2020. 11 Supplementary Materials A. Construction Algorithm of the Chameleon As stated in the main paper, the construction of ...