BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Baoyuan Wu; Guangliang Cheng; Haiquan Wen; Lu Qi; Tianxiao Li; Xiangtai Li; Xingru Huang; Yiwei He; Zhenglin Huang; Zihan Yu

arxiv: 2505.12620 · v8 · pith:VZON726Rnew · submitted 2025-05-19 · 💻 cs.CV

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Haiquan Wen , Yiwei He , Zhenglin Huang , Tianxiao Li , Zihan Yu , Xingru Huang , Lu Qi , Baoyuan Wu

show 2 more authors

Xiangtai Li Guangliang Cheng

This is my paper

classification 💻 cs.CV

keywords busterxdetectionvideomodelstextbfaccuracyai-generatedbaseline

0 comments

read the original abstract

As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explainable Forensics of Manipulated Segments in Untrimmed Long Videos
cs.CV 2026-06 unverdicted novelty 7.0

Introduces TASLE benchmark and MSLoc baseline for temporal localization and explanation of manipulated segments in long videos.
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
cs.CV 2026-05 unverdicted novelty 6.0

Introduces a commercial-model contrastive AIGC video dataset and a hybrid contrastive-MLLM detection framework claiming SOTA performance on realistic video forgery detection.
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
cs.CV 2026-05 unverdicted novelty 6.0

Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
cs.CV 2025-12 unverdicted novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection
cs.CV 2026-06 unverdicted novelty 5.0

ReConFuse detects AI-generated videos by fusing WF-VAE reconstruction error patterns with multi-frame semantic features via a Mamba-based temporal model.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
cs.CV 2026-05 unverdicted novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
cs.CV 2026-06 unverdicted novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.