MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

· 2025 · cs.AI · arXiv 2510.17590

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We present MERIT, an inference-time modular framework for multimodal misinformation detection that decomposes verification into four specialized modules: visual forensics, cross-modal alignment, retrieval-augmented claim verification, and calibrated judgment. On MMFakeBench, MERIT with GPT-4o-mini achieves 81.65% F1, outperforming all reported zero-shot baselines including GPT-4V with MMD-Agent (74.0% F1). A controlled same-model evaluation confirms gains stem from architectural design: MERIT achieves 6.14 points higher misinformation recall than MMD-Agent under identical model conditions, with per-class gains of +18.0 on visual distortion and +5.33 on textual distortion. Ablation studies reveal non-overlapping module specialization, where removing any module disproportionately degrades its target category while leaving others intact. Test set evaluation on 5,000 samples confirms generalization within 0.21 F1 points of validation results. The framework operates with any instruction-following vision-language model and produces citation-linked rationales for human review.

representative citing papers

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

EVID-Bench supplies 222 videos across nine manipulation types in three categories and shows that frontier multimodal models reach at most 61.43% point-level accuracy when forced to use web search to identify false information.

citing papers explorer

Showing 1 of 1 citing paper.

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection cs.CV · 2026-06-02 · unverdicted · none · ref 12 · internal anchor
EVID-Bench supplies 222 videos across nine manipulation types in three categories and shows that frontier multimodal models reach at most 61.43% point-level accuracy when forced to use web search to identify false information.

MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer