Multimodal Claim Extraction for Fact-Checking
Pith reviewed 2026-05-16 09:12 UTC · model grok-4.3
The pith
The first benchmark for multimodal claim extraction shows that intent-aware modeling improves extraction from text-image social media posts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that multimodal claim extraction from social media requires explicit modeling of rhetorical intent and contextual cues; the first benchmark dataset, built from real fact-checker annotations, exposes the shortcomings of baseline multimodal LLMs on semantic alignment, faithfulness, and decontextualization, while the new MICE framework delivers measurable gains on the same three metrics in intent-critical instances.
What carries the argument
MICE, an intent-aware framework that augments multimodal LLMs with explicit modeling of rhetorical intent and contextual cues to guide claim extraction.
Load-bearing premise
Claims written by professional fact-checkers form a reliable and representative target for what an automated system should extract from multimodal posts.
What would settle it
A controlled test on new posts in which independent fact-checkers extract claims and MICE scores lower than baselines on the three-part evaluation of semantic alignment, faithfulness, and decontextualization.
Figures
read the original abstract
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present the first benchmark for multimodal claim extraction from social media posts containing text and one or more images, with annotations consisting of gold-standard claims derived from real-world fact-checkers. It evaluates state-of-the-art multimodal LLMs under a three-part framework (semantic alignment, faithfulness, and decontextualization), reports that baselines struggle to model rhetorical intent and contextual cues, and introduces the MICE intent-aware framework, which shows improvements in intent-critical cases.
Significance. If the benchmark construction and reported improvements hold after addressing the representativeness concern, the work would be a meaningful contribution to automated fact-checking. It targets an underexplored multimodal setting distinct from text-only claim extraction or standard vision-language tasks, supplies a new annotated dataset, and proposes an evaluation framework that explicitly separates semantic, faithfulness, and decontextualization dimensions. These elements could serve as a foundation for downstream AFC systems that must handle memes, screenshots, and rhetorical social-media content.
major comments (1)
- [Benchmark Construction and Evaluation Framework] The headline result—that MICE improves extraction in intent-critical cases—depends on the benchmark targets being representative. Gold-standard claims drawn from fact-checker outputs preferentially sample verifiable, high-stakes propositional content; many multimodal posts are rhetorical, context-dependent, or only partially propositional. Without evidence that the annotation distribution matches the broader population of social-media claims (e.g., via a comparison to a random sample of multimodal posts), measured gains may be an artifact of the benchmark rather than evidence of improved multimodal reasoning. This issue is load-bearing for the central claim.
minor comments (2)
- [Abstract] The abstract asserts that MICE 'shows improvements' yet supplies no quantitative metrics, confidence intervals, or statistical tests. Adding the key numbers (e.g., delta on each of the three metrics for intent-critical vs. non-critical subsets) would make the summary self-contained.
- [Evaluation Framework] The three-part evaluation framework is introduced without explicit formulas or annotation guidelines for semantic alignment, faithfulness, and decontextualization. Providing the precise scoring rubrics or inter-annotator agreement figures would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment on benchmark representativeness below and will incorporate revisions to strengthen the discussion of the benchmark's scope and intended use case in automated fact-checking.
read point-by-point responses
-
Referee: The headline result—that MICE improves extraction in intent-critical cases—depends on the benchmark targets being representative. Gold-standard claims drawn from fact-checker outputs preferentially sample verifiable, high-stakes propositional content; many multimodal posts are rhetorical, context-dependent, or only partially propositional. Without evidence that the annotation distribution matches the broader population of social-media claims (e.g., via a comparison to a random sample of multimodal posts), measured gains may be an artifact of the benchmark rather than evidence of improved multimodal reasoning. This issue is load-bearing for the central claim.
Authors: We agree that representativeness merits explicit discussion. Our benchmark is deliberately constructed from posts with gold-standard claims sourced from professional fact-checkers, as these constitute the verifiable, high-stakes content that automated fact-checking pipelines are designed to process. This distribution aligns with the real-world targets of AFC rather than the full, noisier population of social-media posts. We acknowledge that the dataset may under-represent purely rhetorical or non-propositional multimodal content. To address the concern, we will add a dedicated subsection under Limitations that (1) details the fact-checker-driven construction process, (2) clarifies that performance gains are reported for intent-critical cases within this AFC-relevant distribution, and (3) notes the absence of a random-sample comparison as a scope limitation. This revision will make the intended applicability of MICE transparent without overstating generalizability. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces a new benchmark dataset of multimodal social media posts annotated with gold-standard claims from external fact-checkers and proposes the MICE intent-aware framework. No equations, fitted parameters, or self-citations appear as load-bearing elements in the central claims. The three-part evaluation (semantic alignment, faithfulness, decontextualization) and reported improvements are defined independently of any prior fitted quantities or self-referential definitions within the work. The derivation chain consists of dataset construction and framework design rather than any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Professional fact-checker annotations provide a reliable target distribution for what constitutes a claim in a multimodal post.
- ad hoc to paper Rhetorical intent is a separable and modelable component that improves claim extraction when explicitly handled.
invented entities (1)
-
MICE framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert
Ammeba: A large-scale survey and dataset of media-based misinformation in-the-wild. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated evalu- ation of retrieval augmented generation. InProceed- ings of the 18th Conference of the European Chap- ter of the Association for Computational Linguistics: System Demonstratio...
work page 2024
-
[2]
M4fc: a multimodal, multilingual, multicul- tural, multitask real-world fact-checking dataset. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A survey on llm-as-a-judge. Naeemul Hassan, Chengkai Li, a...
work page 2025
-
[3]
Detecting check-worthy factual claims in pres- idential debates. InProceedings of the 24th ACM In- ternational on Conference on Information and Knowl- edge Management, CIKM ’15, page 1835–1838, New York, NY , USA. Association for Computing Machin- ery. Mahmoud Khademi, Ziyi Yang, Felipe Frujeri, and Chenguang Zhu. 2023. MM-reasoner: A multi- modal knowled...
work page 2023
-
[4]
Piecing it all together: Verifying multi-hop multimodal claims. InProceedings of the 31st Inter- national Conference on Computational Linguistics, pages 7453–7469, Abu Dhabi, UAE. Association for Computational Linguistics. Barry Menglong Yao, Aditya Shah, Lichao Sun, Jin-Hee Cho, and Lifu Huang. 2023. End-to-end multimodal fact-checking and explanation ge...
work page 2023
-
[5]
Multimodal misinformation detection by learn- ing from synthetic data with multimodal LLMs. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 10467–10484, Miami, Florida, USA. Association for Computational Lin- guistics. A MMCE Dataset The breakdown of the social media sites of the data from MMCE is shown in table A1. Socia...
work page 2024
-
[8]
Claim Refinement Process : - Remove subjective language - Distill the core factual assertion - Ensure the claim is neutral and objective # CLAIM SELECTION STRATEGY - Always try to extract just one main claim first - If the text contains one main factual assertion , extract only that claim - If multiple statements can be combined into one coherent claim , ...
-
[9]
Identify Potential Claims : - Look for definitive statements - Detect implied assertions - Recognize potentially misleading or exaggerated claims
-
[10]
Claim Criteria : - Clarity : Can the claim stand alone and be understood without the original context ? - Specificity : Does the claim capture the most significant factual assertion ? - Verifiability : Does the claim provide enough detail to enable fact - checking ?
-
[11]
Claim Refinement Process : - Remove subjective language - Distill the core factual assertion - Ensure the claim is neutral and objective - Consider whether the image alters , reinforces , or creates the perceived claim # CLAIM SELECTION STRATEGY - Always try to extract just one main claim first - If the text contains one main factual assertion , extract o...
-
[12]
INTENT : What ’ s the main purpose of the post ? ( inform / persuade / entertain / satire / etc .)
-
[13]
TONE : What ’ s the emotional tone of the post ? ( serious / humorous / sarcastic / anger / etc .)
-
[14]
CONTEXT : What real - world events / issues does this relate to ? Include specific details like dates , locations , people , organizations
-
[15]
VISUAL_CONTEXT : What specific people , objects , locations , or events are shown in the image that provide context for the claim ? # RESPONSE FORMAT Return your analysis as a JSON object with the following structure : ‘‘‘ json { " intent ": " description of the poster ’ s main purpose " , " tone ": " description of the emotional tone " , " context ": " b...
-
[16]
entailed : The claim is fully aligned with the post content without any contradictions , hallucinations or unsupported additions
-
[17]
p a rt i al l y _e n t ai l e d : The claim is partially aligned with the post content but contains minor variations or additional context not stated or implied in the post
-
[18]
not_entailed : The claim contains significant misaligned inferences , exaggerations beyond what ’ s stated , major contradictions , hallucinations , or completely misaligns with the post content . D e c o n t e x t u a l i z a t i o n [ Look at column D only ] You will be given a candidate claim that was extracted from a social media post . Your task is t...
-
[19]
f u l l y _ d e c o n t e x t u a l i z e d : Understandable in isolation . The claim is completely self - contained , unambiguous , and requires no edits to be understood on its own . ( Example : The mayor of NYC announced a new recycling program on June 1 , 2024.)
work page 2024
-
[20]
The claim could benefit from some edits
p a r t i a l l y _ d e c o n t e x t u a l i z e d : The claim is mostly clear and contains some context , but has gaps , vague references or unresolved pronouns . The claim could benefit from some edits . ( Example : Vaccination rates rose after that . -> could be rewrited to -> Vaccination in the UK rates rose after the 2023 campaign .)
work page 2023
-
[21]
n o t _ d e c o n t e x t u a l i z e d : Not understandable in isolation . The claim cannot be interpreted on its own ; key entities , referents , or context are missing . Major rewriting is needed . ( Example : He did something yesterday .) H Error Analysis To surface the key challenges faced in image-text social media claim extraction, we select errone...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.