pith. sign in

arxiv: 2603.22368 · v2 · submitted 2026-03-23 · 💻 cs.CV · cs.AI

When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations

Pith reviewed 2026-05-15 00:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsmisleading visualizationsdata visualizationreasoning errorsvisual design errorsmisinformation detectionbenchmark evaluationchart understanding
0
0 comments X

The pith

Vision-language models detect visual design errors in misleading charts more reliably than reasoning errors in captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well vision-language models can spot misleading data visualizations, particularly when the deception comes from either flawed chart design or tricky reasoning in the accompanying captions. It introduces a benchmark using real visualizations paired with human-written misleading captions that target specific types of errors, such as cherry-picking data or using truncated axes. Models perform better at catching visual design mistakes like inappropriate encodings than at detecting reasoning flaws like false causal claims. They also tend to wrongly label straightforward, non-misleading visualizations as deceptive. This matters for deploying these models in tools that help interpret data, where missing or misattributing deception could allow misinformation to spread.

Core claim

The paper develops a benchmark of real-world visualizations paired with human-authored misleading captions that target a fine-grained taxonomy of reasoning errors such as cherry-picking and causal inference, and visualization design errors such as truncated axes and dual axes. Evaluation of commercial and open-source VLMs shows that models detect visual design errors substantially more reliably than reasoning-based misinformation and frequently misclassify non-misleading visualizations as deceptive. This establishes a measurable gap between coarse detection of misleading content and attribution of the specific error types that produce it.

What carries the argument

A benchmark combining real-world visualizations with human-authored misleading captions designed to elicit specific reasoning errors and visualization design errors from a fine-grained taxonomy.

If this is right

  • VLMs are substantially more reliable at detecting visualization design errors than reasoning-based misinformation.
  • Models often misclassify non-misleading visualizations as deceptive.
  • The benchmark enables controlled analysis across error categories and modalities of misleadingness.
  • This reveals a gap in current VLM capabilities for attributing specific errors that give rise to deception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving VLMs' ability to parse and critique reasoning in captions could raise their reliability in data journalism and automated fact-checking.
  • Hybrid systems that pair VLMs with separate reasoning modules might close the observed performance gap on caption-based deceptions.
  • Extending the benchmark to interactive or animated charts would test whether the design-versus-reasoning difference holds in dynamic settings.

Load-bearing premise

The human-authored misleading captions and the fine-grained taxonomy of reasoning and visualization errors accurately represent and cover the main real-world sources of deception in data visualizations.

What would settle it

Collecting a large set of naturally occurring misleading visualizations from the web without human-authored captions and finding no performance difference between visual design and reasoning error detection would falsify the central claim.

read the original abstract

Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a benchmark of real-world visualizations paired with human-authored misleading captions, organized by a fine-grained taxonomy of reasoning errors (e.g., cherry-picking, causal inference) and visualization design errors (e.g., truncated axes, dual axes, inappropriate encodings). It evaluates multiple commercial and open-source VLMs on this benchmark and reports that models detect visual design errors substantially more reliably than reasoning-based misinformation while frequently misclassifying non-misleading visualizations as deceptive.

Significance. If the benchmark faithfully represents real-world deception, the work is significant for exposing a modality-specific limitation in current VLMs: stronger performance on surface-level visual cues than on subtle reasoning errors in captions. The controlled taxonomy and use of real visualizations enable precise attribution of errors by category, which is a clear advance over coarse misleading-content detection. The multi-model evaluation further strengthens the empirical contribution.

major comments (2)
  1. [Benchmark section] Benchmark construction: the human-authored captions and fixed taxonomy are presented without external validation (e.g., comparison to scraped news or social-media examples of deceptive visualizations). This is load-bearing for the central claim that the observed performance gap reflects model behavior on misleading data visualizations, because the synthetic instances may not match the subtlety, frequency, or visual-reasoning interactions that arise in the wild.
  2. [Experiments and Results] Results and evaluation: the manuscript does not report dataset statistics, inter-annotator agreement for caption curation, or complete per-model/per-category results tables. Without these, the robustness of the headline finding (visual errors detected more reliably than reasoning errors) cannot be fully assessed.
minor comments (1)
  1. [Abstract] The abstract and introduction could more explicitly state the total number of visualizations, the exact split between reasoning and visualization error instances, and the list of evaluated VLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Benchmark section] Benchmark construction: the human-authored captions and fixed taxonomy are presented without external validation (e.g., comparison to scraped news or social-media examples of deceptive visualizations). This is load-bearing for the central claim that the observed performance gap reflects model behavior on misleading data visualizations, because the synthetic instances may not match the subtlety, frequency, or visual-reasoning interactions that arise in the wild.

    Authors: We appreciate this point and agree that stronger grounding in real-world examples would enhance the benchmark. Our visualizations are sourced from real-world charts (e.g., from public datasets and news sources), and the taxonomy draws directly from established literature on misleading visualizations. However, we did not perform a systematic scrape of news or social media for validation in the current version. In the revision, we will add a new subsection detailing the caption curation process, include side-by-side comparisons with 10-15 real-world deceptive examples from news and social media to illustrate alignment, and explicitly discuss limitations regarding subtlety and interactions in the wild. This addresses the concern without overclaiming generalizability. revision: partial

  2. Referee: [Experiments and Results] Results and evaluation: the manuscript does not report dataset statistics, inter-annotator agreement for caption curation, or complete per-model/per-category results tables. Without these, the robustness of the headline finding (visual errors detected more reliably than reasoning errors) cannot be fully assessed.

    Authors: We fully agree that these details are necessary for evaluating robustness. The current manuscript omitted them to focus on the main findings. In the revised version, we will add: (1) full dataset statistics including the number of instances per error category, split by reasoning vs. visualization errors, and total size; (2) inter-annotator agreement metrics (e.g., Cohen's kappa) for the caption curation process, which involved multiple annotators; and (3) complete per-model/per-category results tables with precision, recall, and F1 scores for all VLMs evaluated. These will appear in the main Experiments section with additional details in the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential predictions

full rationale

The paper is an empirical evaluation study that constructs a benchmark of real visualizations paired with human-authored misleading captions based on a fixed taxonomy, then measures VLM performance against human labels. No equations, fitted parameters, or predictive models exist whose outputs reduce to inputs by construction. No self-citations are invoked to justify uniqueness theorems or ansatzes that bear the central claims. The results consist of direct accuracy comparisons across error categories, which are independent of any internal fitting or renaming of known results. The benchmark creation is explicitly described as human-curated without claiming external validation as part of a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the curated captions faithfully instantiate the defined error categories and that these categories comprehensively capture real deceptive practices; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The proposed taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis) covers the primary mechanisms of misleadingness in data visualizations.
    The benchmark is constructed around this taxonomy to enable controlled analysis.

pith-pipeline@v0.9.0 · 5506 in / 1193 out tokens · 52163 ms · 2026-05-15T00:58:17.938924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.