Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Bryan Hooi; Fanxiao Li; Jiaying Wu; Min-Yen Kan; Zihang Fu

arxiv: 2505.15489 · v4 · submitted 2025-05-21 · 💻 cs.CV · cs.CL· cs.MM

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Jiaying Wu , Fanxiao Li , Zihang Fu , Min-Yen Kan , Bryan Hooi This is my paper

Pith reviewed 2026-05-22 13:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM

keywords multimodal misinformationvision-language modelscreator intentdeception detectionbenchmark datasetintent reasoningmisinformation governanceimage-caption pairs

0 comments

The pith

Training vision-language models on simulated creator intent data improves detection of misleading narratives in multimodal news.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DeceptionDecoded, a benchmark of 12,000 image-caption pairs built from trustworthy articles through an intent-guided simulation that models both the influence a creator wants and the execution steps they would take. Current vision-language models struggle with these intent tasks and fall back on surface signals such as stylistic polish or simple alignment between image and text. Training on the new dataset shifts models toward implication-level reasoning and produces strong transfer gains on real-world multimodal misinformation detection. A sympathetic reader would care because multimodal misinformation often works through embedded narratives rather than outright false facts, so intent-aware tools matter for practical governance.

Core claim

We introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles and generated by an intent-guided simulation framework that explicitly models the desired influence and the execution plan of news creators. The dataset includes both misleading and non-misleading examples across visual and textual manipulations and defines three intent-centric tasks: misleading intent detection, misleading source attribution, and creator desire inference. Evaluation of 14 state-of-the-art vision-language models reveals consistent difficulty with intent reasoning, with reliance on shallow cues instead. Models trained on DeceptionDecoded show strong,

What carries the argument

The intent-guided simulation framework that models both the desired influence and the execution plan of news creators to synthesize image-caption pairs reflecting real misleading intent.

If this is right

Models shift from relying on surface-level alignment and stylistic cues to implication-level intent reasoning.
The dataset functions as a diagnostic benchmark that reveals specific fragility points in current vision-language models.
The same framework operates as a scalable synthesis engine for producing high-quality intent-focused training resources.
Improved intent detection supports more effective real-world multimodal misinformation governance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The simulation approach could be extended to generate synthetic examples for other deceptive media formats such as video clips.
Intent-focused benchmarks may help prioritize which real-world items human moderators should examine first.
Explicit modeling of creator desire and plan might generalize to domains outside news, such as advertising or social media posts.

Load-bearing premise

The intent-guided simulation framework produces data that accurately reflects real-world misleading intent and serves as a reliable proxy for training and evaluation.

What would settle it

If models trained on DeceptionDecoded show no performance gain over standard baselines when tested on independently labeled real-world multimodal news for misleading intent, the claim that the framework supplies effective training resources would be falsified.

read the original abstract

The impact of multimodal misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art vision-language models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. To bridge this, our framework systematically synthesizes data that enables models to learn implication-level intent reasoning. Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD, validating our framework as both a benchmark to diagnose VLM fragility and a data synthesis engine that provides high-quality, intent-focused resources for enhancing robustness in real-world multimodal misinformation governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for intent reasoning in multimodal misinformation via simulation, but the transfer claims hinge on unverified fidelity of the synthetic data to real creator behavior.

read the letter

The paper introduces DeceptionDecoded, a 12,000-pair dataset of image-caption news items built through an intent-guided simulation that models both the creator's desired influence and execution plan. It defines three tasks—misleading intent detection, source attribution, and creator desire inference—and evaluates 14 VLMs, showing they often default to surface alignment or stylistic signals instead of deeper implication reasoning. Models fine-tuned on the data are reported to transfer to real-world multimodal misinformation detection.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeceptionDecoded, a benchmark of 12,000 image-caption pairs generated from trustworthy reference articles via an intent-guided simulation framework that explicitly models news creators' desired influence and execution plans. The dataset covers both misleading and non-misleading cases with visual and textual manipulations and defines three intent-centric tasks: misleading intent detection, misleading source attribution, and creator desire inference. Evaluations of 14 VLMs show reliance on shallow cues rather than implication-level reasoning; models fine-tuned on the synthetic data exhibit strong transfer to real-world multimodal misinformation detection (MMD), positioning the framework as both a diagnostic benchmark and a data-synthesis engine.

Significance. If the simulation framework produces data whose misleading strategies and intent distributions match those in authentic multimodal news, the work supplies a reproducible, intent-focused resource that could materially advance VLM robustness for misinformation governance. The scale of the dataset, the systematic coverage of manipulation types, and the transfer results constitute concrete strengths that would be valuable to the community even if further validation of fidelity is required.

major comments (2)

[Abstract and §5] The transferability claim in the abstract and §5 rests on DeceptionDecoded serving as a reliable proxy for real-world creator intent. The intent-guided simulation is described as modeling desired influence plus execution plan, yet no direct empirical grounding is provided (e.g., human annotation study or distributional comparison) showing that the generated pairs match authentic MMD examples in frequency of specific visual-textual misalignments or subtlety of desire inference. Without such validation, both the diagnostic benchmark and the data-synthesis claims are load-bearing assumptions rather than demonstrated results.
[§4] §4 (VLM evaluation) reports that models struggle with intent reasoning and rely on surface-level alignment or stylistic cues, but the quantitative metrics, exact prompting protocols, and per-task breakdowns that support this diagnosis are not detailed enough to assess whether the observed failures are robust or artifactual. This directly affects the justification for using the synthetic data as a training resource.

minor comments (2)

[Methods] Clarify in the methods section how the 12,000 pairs were sampled from reference articles and what safeguards prevent leakage of real-world MMD examples into the synthetic set.
[Tables and Figures] Figure captions and table legends should explicitly state the real-world MMD test set used for transfer experiments and the statistical tests applied to the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for strengthening the validation of our simulation framework and the transparency of our experimental results. We address each major comment below and outline specific revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and §5] The transferability claim in the abstract and §5 rests on DeceptionDecoded serving as a reliable proxy for real-world creator intent. The intent-guided simulation is described as modeling desired influence plus execution plan, yet no direct empirical grounding is provided (e.g., human annotation study or distributional comparison) showing that the generated pairs match authentic MMD examples in frequency of specific visual-textual misalignments or subtlety of desire inference. Without such validation, both the diagnostic benchmark and the data-synthesis claims are load-bearing assumptions rather than demonstrated results.

Authors: We agree that direct empirical validation of distributional fidelity would strengthen the proxy claim. The current manuscript grounds the simulation in trustworthy reference articles and demonstrates utility via transfer results in §5, but does not include an explicit human annotation study or side-by-side distributional comparison of manipulation frequencies. To address this, the revised manuscript will add a new subsection (in §3) that reports (i) a distributional comparison of visual-textual misalignment types against a sample of real-world MMD instances drawn from established datasets and (ii) results from a small-scale human validation study assessing perceived subtlety of creator intent. These additions will be referenced in the abstract and §5 to support the transferability claims. revision: yes
Referee: [§4] §4 (VLM evaluation) reports that models struggle with intent reasoning and rely on surface-level alignment or stylistic cues, but the quantitative metrics, exact prompting protocols, and per-task breakdowns that support this diagnosis are not detailed enough to assess whether the observed failures are robust or artifactual. This directly affects the justification for using the synthetic data as a training resource.

Authors: We concur that greater detail is required for reproducibility and to confirm the robustness of the observed limitations. The revised §4 will be expanded to include: the complete quantitative metrics (accuracy, F1, and error rates with standard deviations across runs), the exact prompting templates and few-shot examples used for each of the three tasks, and full per-task and per-model breakdowns together with qualitative error analysis illustrating reliance on surface cues versus implication-level reasoning. These additions will make the diagnosis of VLM shortcomings more transparent and directly support the rationale for fine-tuning on DeceptionDecoded. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new dataset synthesis and VLM evaluations are independent

full rationale

The paper constructs DeceptionDecoded via an intent-guided simulation framework that models creator influence and execution plans, then evaluates 14 VLMs on three new intent-centric tasks and tests transfer to real-world MMD. These elements rely on original data generation and fresh empirical benchmarks rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. The simulation serves as an input generator whose outputs are externally tested, satisfying the criteria for a self-contained derivation against external benchmarks with no quoted reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark and data synthesis paper. It relies on domain assumptions about simulation fidelity rather than free parameters or new invented entities.

axioms (1)

domain assumption Trustworthy reference articles provide accurate ground truth for modeling creator intent and execution plans.
The dataset is grounded in trustworthy reference articles to create both misleading and non-misleading cases.

pith-pipeline@v0.9.0 · 5777 in / 1357 out tokens · 47106 ms · 2026-05-22T13:48:54.109440+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.