Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Chang Chen; Jen-tse Huang; Mark Dredze; Michelle R. Kaufman; Shiyang Lai; Wenxuan Wang

REVIEW 3 major objections 2 minor 1 cited by

Multimodal LLMs show uneven resistance to cognitive biases in Chinese short-video health misinformation, with Gemini-2.5-Pro scoring 71.5 and o3 scoring 35.2 on a belief metric.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 15:07 UTC pith:I4QXRJI5

load-bearing objection The paper adds a new annotated set of 200 Chinese health videos and tests MLLMs on visual plus social misinformation cues, but the belief scoring and annotation checks stay underspecified. the 3 major comments →

arxiv 2601.06600 v4 pith:I4QXRJI5 submitted 2026-01-10 cs.CL

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Jen-tse Huang , Chang Chen , Shiyang Lai , Wenxuan Wang , Michelle R. Kaufman , Mark Dredze This is my paper

classification cs.CL

keywords multimodal large language modelscognitive biasesmisinformationshort videoshealth claimsdeceptive patternsbelief scoresocial cues

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evaluation framework to measure how multimodal large language models respond to short videos that mix visual demonstrations with deceptive claims and social signals. It supplies a set of 200 manually labeled videos drawn from four health topics, each tagged for experimental mistakes, logical flaws, or invented assertions backed by official standards. Tests across eight frontier models and five input formats reveal consistent performance gaps and specific weaknesses to cues such as authoritative channel names. These findings matter because short-video platforms now dominate health-information spread and models are increasingly used to interpret or filter such content.

Core claim

A dataset of 200 Chinese short videos spanning four health domains supplies fine-grained annotations for three deceptive patterns—experimental errors, logical fallacies, and fabricated claims—each checked against national standards and academic sources. When eight frontier multimodal LLMs are tested in five modality settings, Gemini-2.5-Pro records the highest belief score of 71.5 out of 100 while o3 records the lowest at 35.2, and the models prove vulnerable to social cues such as authoritative channel IDs that trigger false beliefs.

What carries the argument

The manually annotated dataset of 200 short videos with labels for deceptive patterns, together with a belief score that quantifies model resistance across modality settings.

Load-bearing premise

The manually annotated dataset of 200 short videos supplies accurate fine-grained labels for deceptive patterns verified by national standards and literature, and the belief score accurately reflects each model's susceptibility to cognitive biases.

What would settle it

Running the identical eight models on a fresh collection of 200 short videos that preserve the same distribution of deceptive patterns and social cues and obtaining a reversed model ranking or loss of susceptibility to channel IDs.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Gemini-2.5-Pro maintains higher resistance than other models when given full video input.
Authoritative channel IDs reliably increase false belief rates across tested models.
Performance varies with the modality setting, from text-only to complete video.
The three deceptive patterns produce measurable differences in model belief scores.
The framework supplies a repeatable benchmark for tracking progress on video misinformation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on longer-form content may need targeted fine-tuning on short, fast-paced video formats to reduce these biases.
Platforms that rely on MLLMs for moderation would still need separate checks for channel authority signals.
The same evaluation could be repeated on non-health topics to test whether bias patterns generalize.
Human viewers exposed to the same videos may exhibit parallel vulnerabilities, offering a comparison point for model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

3 major / 2 minor

Summary. The paper introduces a manually annotated dataset of 200 Chinese short videos across four health domains with fine-grained labels for three deceptive patterns (experimental errors, logical fallacies, fabricated claims) verified against national standards and literature. It evaluates eight frontier MLLMs across five modality settings using a belief score to quantify susceptibility to cognitive biases in misinformation, reporting Gemini-2.5-Pro at 71.5/100 (highest) and o3 at 35.2 (lowest) in the multimodal setting, while also analyzing social cues such as authoritative channel IDs that induce false beliefs.

Significance. If the measurement pipeline is validated, the work would offer timely empirical evidence on MLLM vulnerabilities to visually and socially entangled misinformation in short-video platforms, with direct relevance to public health communication and AI safety in non-English contexts.

major comments (3)

[Abstract and Evaluation Framework] Abstract and §4 (Evaluation): The headline model rankings rest on the belief score (e.g., 71.5 vs 35.2), yet the manuscript supplies no explicit definition, formula, or mapping from raw model outputs to the 0-100 scale, nor any inter-annotator agreement or statistical significance for the 200-video results; this directly undermines verification of the susceptibility claims.
[Dataset Construction] §3 (Dataset): The claim that annotations are 'fine-grained' and 'verified by national standards' is load-bearing for all downstream results, but no annotation protocol, annotator count, agreement metric, or validation procedure is described, leaving open the possibility that label noise could reverse the reported ordering between models.
[Social Cue Investigation] §5 (Social Cue Analysis): The finding that models are susceptible to biases like authoritative channel IDs inherits the same unverified belief-score pipeline; without details on how social cues are isolated or scored, the causal attribution to specific cues cannot be assessed.

minor comments (2)

[Experimental Setup] Clarify the precise definitions of the five modality settings and how prompts differ across them.
[Results] Add a table summarizing per-model, per-modality belief scores with standard deviations or confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to incorporate additional methodological details where the comments correctly identify gaps in the current version.

read point-by-point responses

Referee: [Abstract and Evaluation Framework] Abstract and §4 (Evaluation): The headline model rankings rest on the belief score (e.g., 71.5 vs 35.2), yet the manuscript supplies no explicit definition, formula, or mapping from raw model outputs to the 0-100 scale, nor any inter-annotator agreement or statistical significance for the 200-video results; this directly undermines verification of the susceptibility claims.

Authors: We agree that an explicit definition and formula for the belief score are required for reproducibility and verification. In the revised manuscript we have expanded §4 to include the precise definition of the belief score, the formula mapping raw model outputs (e.g., assessed belief level or probability) to the 0-100 scale, inter-annotator agreement statistics for the underlying annotations, and statistical significance tests for the reported model differences. revision: yes
Referee: [Dataset Construction] §3 (Dataset): The claim that annotations are 'fine-grained' and 'verified by national standards' is load-bearing for all downstream results, but no annotation protocol, annotator count, agreement metric, or validation procedure is described, leaving open the possibility that label noise could reverse the reported ordering between models.

Authors: The referee is correct that the current description of the annotation process is incomplete. We have revised §3 to provide the full annotation protocol, the number of annotators, the inter-annotator agreement metric employed, and the specific validation steps used to cross-check labels against national standards and academic literature. revision: yes
Referee: [Social Cue Investigation] §5 (Social Cue Analysis): The finding that models are susceptible to biases like authoritative channel IDs inherits the same unverified belief-score pipeline; without details on how social cues are isolated or scored, the causal attribution to specific cues cannot be assessed.

Authors: We acknowledge that the social-cue analysis requires additional methodological transparency to support causal claims. In the revised §5 we have added explicit descriptions of how social cues (such as channel IDs) are isolated within the videos, the scoring procedure for their influence on model belief scores, and any controls applied to attribute effects to individual cues. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on new annotated dataset

full rationale

The paper constructs a new dataset of 200 short videos with manual annotations for deceptive patterns, then directly evaluates eight MLLMs across modality settings to produce belief scores. No equations, fitted parameters, predictions derived from subsets, or self-citations are invoked to justify the core results. The reported scores (71.5 for Gemini-2.5-Pro, 35.2 for o3) are outputs of the evaluation pipeline rather than inputs renamed or forced by definition. The framework is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of manual annotations for deceptive patterns and the validity of the belief score as a measure of bias susceptibility; these are domain assumptions not independently verified in the provided abstract.

axioms (1)

domain assumption Manual annotations by experts using national standards and academic literature accurately capture the three deceptive patterns in the videos.
Invoked when describing the dataset construction and verification process in the abstract.

pith-pipeline@v0.9.0 · 5715 in / 1350 out tokens · 65013 ms · 2026-05-21T15:07:58.251939+00:00 · methodology

0 comments

read the original abstract

Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns-experimental errors, logical fallacies, and fabricated claims-each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

Figures

Figures reproduced from arXiv: 2601.06600 by Chang Chen, Jen-tse Huang, Mark Dredze, Michelle R. Kaufman, Shiyang Lai, Wenxuan Wang.

**Figure 1.** Figure 1: Overview of the data structure. Upper-left: The high-quality dataset consists of twelve fields. Upper-right: Videos from Douyin and Kuaishou are processed into visual, textual, and aural modalities, with histograms depicting token length distributions. Lower-left: Misinformation is annotated with detailed error reasons, supporting evidence, and error types. Lower-right: The dataset is categorized into four… view at source ↗

**Figure 2.** Figure 2: Belief Scores (BS) of eight models across three error types on the false video subset. Yellow “Score” lines [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Belief Scores (BS) of four models with dif [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The difference (rwith_ID − rwithout_ID) rescaled by 100 3 across four verification statuses for eight models using the Claim setting and all video data. Results on the two subsets are provided in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average score decrease after using channel [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: The [CLAIM] is replaced by actual claim of each video. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The [TEXT] is replaced by actual textual text of each video. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The [TRANSCRIPT] is replaced by actual aural transcript of each video. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The [IMAGE] is replaced by actual visual frames of each video. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The [TRANSCRIPT] and [IMAGE] are replaced by actual aural transcripts and visual frames of each video. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: The [REASONING] and [REASON] are replaced by actual CoT output and the error reason for each video. Popularity Effect [Original Claim or Multimodal Prompt] — [Corresponding Data Input] This short video has received [A] views, [B] likes, [C] shares, and [D] comments [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: The [Corresponding Data Input] is replaced by actual input data along with (views, likes, shares, comments) popularity statistics. Channel ID Effect [Original Claim or Multimodal Prompt] — [Corresponding Data Input] This short video was uploaded by [CHANNEL ID] [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: The [Corresponding Data Input] is replaced by actual input data along with [CHANNEL ID]. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection
cs.CV 2026-06 unverdicted novelty 7.0

EVID-Bench supplies 222 videos across nine manipulation types in three categories and shows that frontier multimodal models reach at most 61.43% point-level accuracy when forced to use web search to identify false inf...