BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Guangliang Cheng; Haiquan Wen; Tianxiao Li; Yiwei He; Zhenglin Huang

arxiv: 2507.14632 · v4 · pith:6ALDP3YFnew · submitted 2025-07-19 · 💻 cs.CV

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Haiquan Wen , Tianxiao Li , Zhenglin Huang , Yiwei He , Guangliang Cheng This is my paper

classification 💻 cs.CV

keywords imagevideocross-modalunifiedbusterxdetectionpolicyreasoning

0 comments

read the original abstract

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
cs.CV 2026-05 unverdicted novelty 6.0

Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
cs.CV 2026-04 unverdicted novelty 6.0

DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
cs.CV 2025-12 unverdicted novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
cs.CV 2026-05 unverdicted novelty 5.0

VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.