arxiv: 2604.12033 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Nicholas Moratelli , Christopher Davis , Leonardo F. R. Ribeiro , Bill Byrne , Gonzalo Iglesias

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords large vision-language modelsdeflectionhallucinationmultimodal retrievalbenchmarkknowledge-intensive QAconflicting evidenceretrieval-augmented generation

0 comments

The pith

Large vision-language models usually fail to deflect when presented with noisy or misleading multimodal evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark and evaluation method to test how large vision-language models respond when retrieved visual and textual information is incomplete, conflicting, or insufficient for answering questions. Existing tests become outdated as models absorb more data into their parameters, and they do not measure whether models choose to deflect rather than generate unsupported answers. A dynamic curation process selects only questions that truly require external retrieval, and four distinct scenarios separate cases of pure memorization from cases where models must rely on the provided evidence. Tests on twenty current models show that deflection is rare, so the work argues that reliable evaluation of knowledge-intensive multimodal systems must track not only what models get right but how they behave when the evidence does not support an answer.

Core claim

The central claim is that across twenty state-of-the-art large vision-language models, models usually fail to produce deflections such as 'Sorry, I cannot answer' when the retrieved evidence is noisy or misleading. This behavior is measured with VLM-DeflectionBench, a set of 2,775 samples that cover diverse multimodal retrieval settings, and with a fine-grained protocol of four scenarios that isolate parametric memorization from retrieval-dependent reasoning. A dynamic curation pipeline keeps the benchmark difficult by continuously filtering for questions that cannot be answered from model parameters alone.

What carries the argument

VLM-DeflectionBench together with its four-scenario evaluation protocol, which separates memorized answers from responses that depend on the quality of retrieved visual and textual evidence.

Load-bearing premise

The dynamic curation pipeline correctly identifies questions that genuinely require retrieval and the four scenarios cleanly distinguish memorized knowledge from retrieval behavior.

What would settle it

A new set of models that deflect correctly on the majority of the benchmark samples while still answering non-retrieval questions accurately would falsify the reported failure to deflect.

Figures

Figures reproduced from arXiv: 2604.12033 by Bill Byrne, Christopher Davis, Gonzalo Iglesias, Leonardo F. R. Ribeiro, Nicholas Moratelli.

**Figure 1.** Figure 1: Overview of VLM-DeflectionBench. Top: LVLMs often hallucinate instead of abstaining when context is misleading. Bottom: VLM-DeflectionBench evaluates calibration across four scenarios-Parametric, Oracle, Realistic, and Adversarial to test whether models align their behavior with available knowledge. ground their answers in retrieved evidence should state that the user request cannot be fulfilled when said… view at source ↗

**Figure 2.** Figure 2: Pipeline of VLM-DeflectionBench. Starting from 6 benchmarks, we apply parametric filtering to remove query-solvable samples (STAGE I), retrieve negative (multimodal) contexts via different indices (STAGE II), then perform oracle filtering to eliminate false positive contexts (where gating models fail) and unreliable negative contexts (where gating models succeed) (STAGE III). Gj (q, v). Responses are judge… view at source ↗

**Figure 3.** Figure 3: Effect of distractor quantity in the Realistic scenario, including Oracle results (0 distractors) for reference. Accuracy declines and hallucination rises as negatives increase, while deflection shows limited improvement. deflection in realistic ones, undermining accuracy even when valid evidence is present. While modelspecific prompt optimization might achieve better accuracy-deflection trade-offs, our … view at source ↗

**Figure 4.** Figure 4: Comparison of category distributions before and after curation. (Left) Aggregate distribution of the [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Performance on Parametric Scenario and dataset size as models are incrementally added to the filtering pipeline. Convergence is reached after four models, with over 90% of samples removed. same SIMPLEQA rubric with three labels: CORRECT, INCORRECT, and NOT_ATTEMPTED. We compared human and GPT-4o labels using two complementary metrics: (i) raw agreement (%), the proportion of identical labels, and (ii) C… view at source ↗

**Figure 6.** Figure 6: System prompt used in our RAG experiments showing the initial instructions given to the model. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Soft prompt variant encouraging models to answer based on available information with optional deflection [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Moderate prompt variant instructing models to answer only when sufficiently confident in accuracy, with [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Severe prompt variant requiring models to answer only when completely certain, mandating deflection for [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Example from ENCYCLOPEDIC-VQA (Oracle). All models are given gold textual evidence about the coral snake’s diet. OVIS2-16B and CLAUDE-OPUS-4 ground their answers correctly, while MISTRAL-3.1-24B deflects despite sufficient information, illustrating conservative behavior under clear evidence. Source: ENCYCLOPEDIC-VQA Context: The agile frog prefers light deciduous mixed forests with plentiful water. The op… view at source ↗

**Figure 11.** Figure 11: Example from ENCYCLOPEDIC-VQA (Oracle). Gold text specifies the habitat; OVIS2-16B answers correctly while MISTRAL-3.1-24B and CLAUDE-OPUS-4 over-deflect under gold evidence [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Example from MRAG-BENCH (Oracle, visual gold). OVIS2-16B hallucinates a wrong car model; MISTRAL-3.1-24B and CLAUDE-OPUS-4 abstain. A grounding failure with gold images [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Example from MRAG-BENCH (Oracle, visual gold). MISTRAL-3.1-24B identifies the breed correctly; OVIS2-16B hallucinates; CLAUDE-OPUS-4 over-deflects. Calibration varies despite gold images [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Example from MMDOCRAG (Oracle). Gold text supports exact numeric extraction: OVIS2-16B is correct; MISTRAL-3.1-24B over-deflects; CLAUDE-OPUS-4 hallucinates outdated figures [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Example from WEBQA (Oracle, visual gold). Counting letters from the gold image: OVIS2-16B and CLAUDE-OPUS-4 hallucinate, MISTRAL-3.1-24B over-deflects. Source: WEBQA Context (Images): Context (Text): NO CONTEXT Question: Which is taller: the ’50 Ford F-Series or Lotus Elise? GT Answer: The ’50 Ford F-Series is taller. OVIS2-16B: Based on the images provided, the ’50 Ford F-Series appears to be taller than… view at source ↗

**Figure 16.** Figure 16: Example from WEBQA (Oracle, visual gold). With two gold images, OVIS2-16B correctly infers the taller object; MISTRAL-3.1-24B and CLAUDE-OPUS-4 over-deflect [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a new benchmark for testing deflection in LVLMs under bad evidence but the curation details are too thin to trust the main results yet.

read the letter

Your colleague should know that this paper puts forward VLM-DeflectionBench, a 2,775-sample dataset for multimodal questions, plus a dynamic curation pipeline to keep samples retrieval-dependent over time and a four-scenario protocol meant to separate memorization from how models handle noisy or missing evidence. Experiments on 20 LVLMs show they mostly fail to deflect, which lines up with practical reliability problems in knowledge-based visual QA. The work does a good job naming a gap that standard benchmarks ignore, especially conflicts between visual and text evidence, and the plan to release the data is the right call. The four scenarios are a sensible way to try to isolate behaviors. The soft spot is the dynamic curation itself. The abstract says the pipeline filters for genuinely retrieval-dependent samples to preserve difficulty, yet it gives no concrete rules, no zero-shot ablations without context, and no checks on whether the scenarios actually stay separate. If even a modest share of the questions can be answered from training data alone, then the reported non-deflection could just be memorized answers rather than failure to recognize insufficient evidence. That would undercut the central claim. This is for groups building or auditing reliable multimodal QA systems. It deserves peer review because the problem is real and the protocol could be useful, but any referee should ask for explicit validation of the filter and scenario separation before the results can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper proposes a dynamic data curation pipeline to maintain benchmark difficulty over time by selecting genuinely retrieval-dependent multimodal questions, introduces VLM-DeflectionBench consisting of 2,775 samples across diverse retrieval settings, defines a fine-grained evaluation protocol with four scenarios intended to disentangle parametric memorization from retrieval behavior, and reports experimental results on 20 state-of-the-art LVLMs showing that models generally fail to generate deflections when presented with noisy, conflicting, or insufficient evidence.

Significance. If the curation pipeline and scenario separation are shown to be valid, the benchmark would offer a timely and reusable resource for evaluating LVLM reliability in knowledge-based visual QA, emphasizing the importance of deflection behaviors alongside accuracy and directly addressing the rapid obsolescence of static benchmarks due to model training data growth.

major comments (2)

[dynamic data curation pipeline description] Dynamic data curation pipeline: The manuscript states that the pipeline filters for retrieval-dependent samples to preserve difficulty, yet provides no validation such as zero-shot accuracy on the curated questions without retrieved context, ablation comparing curated vs. non-curated sets, or metrics confirming low parametric contamination; this directly affects whether the central claim of retrieval-specific failure to deflect can be isolated from hallucination of memorized answers.
[evaluation protocol and scenarios] Four scenarios and evaluation protocol: The scenarios are presented as cleanly disentangling parametric memorization from retrieval robustness, but the paper supplies no inter-scenario overlap statistics, confusion matrices, or ablation results demonstrating separation; without this, the fine-grained protocol's ability to support the reported findings on model behavior under noisy evidence remains unconfirmed.

minor comments (2)

[abstract] The abstract and introduction would benefit from a brief table or paragraph summarizing the exact distribution of the 2,775 samples across the four scenarios, modalities, and evidence conflict types.
[related work] Ensure all prior benchmarks on LVLM hallucination and deflection (e.g., those focused on visual QA or retrieval-augmented generation) are cited in the related work section for proper positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [dynamic data curation pipeline description] Dynamic data curation pipeline: The manuscript states that the pipeline filters for retrieval-dependent samples to preserve difficulty, yet provides no validation such as zero-shot accuracy on the curated questions without retrieved context, ablation comparing curated vs. non-curated sets, or metrics confirming low parametric contamination; this directly affects whether the central claim of retrieval-specific failure to deflect can be isolated from hallucination of memorized answers.

Authors: We agree that additional validation would strengthen the isolation of retrieval-dependent behavior. In the revised manuscript, we will add zero-shot accuracy results on the curated questions without any retrieved context to demonstrate that models cannot answer them from parametric knowledge alone. We will also include an ablation comparing performance on the curated versus non-curated sets and report explicit metrics on parametric contamination, such as the fraction of questions answered correctly without retrieval. revision: yes
Referee: [evaluation protocol and scenarios] Four scenarios and evaluation protocol: The scenarios are presented as cleanly disentangling parametric memorization from retrieval robustness, but the paper supplies no inter-scenario overlap statistics, confusion matrices, or ablation results demonstrating separation; without this, the fine-grained protocol's ability to support the reported findings on model behavior under noisy evidence remains unconfirmed.

Authors: We acknowledge that empirical evidence of scenario separation would better support the protocol's validity. In the revision, we will provide inter-scenario overlap statistics, including sample distributions and any multi-label overlaps, along with confusion matrices or correlation analyses across the four scenarios. We will also add ablation results showing model performance differences across scenarios to confirm the protocol disentangles parametric memorization from retrieval robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with procedural definitions and external validation

full rationale

The manuscript introduces VLM-DeflectionBench via a dynamic curation pipeline and four evaluation scenarios, but contains no derivations, equations, fitted parameters renamed as predictions, or self-referential definitions. Claims rest on experiments across 20 external LVLMs rather than reducing to the pipeline's own outputs by construction. The curation filter and disentanglement protocol are described as procedural filters without tautological closure (e.g., no claim that retrieval-dependence is proven solely by the filter's application). This is a standard self-contained empirical contribution whose central results are falsifiable against held-out models and data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions about LVLM usage rather than new theoretical entities or fitted parameters.

axioms (2)

domain assumption LVLMs increasingly rely on retrieval to answer knowledge-intensive multimodal questions.
Opening premise of the abstract that motivates the need for deflection benchmarks.
domain assumption Existing benchmarks overlook conflicts between visual and textual evidence and suffer from rapid obsolescence.
Stated justification for creating a new benchmark.

pith-pipeline@v0.9.0 · 5522 in / 1306 out tokens · 44578 ms · 2026-05-10T15:23:37.181176+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Mar- cella Cornia, and Rita Cucchiara. 2024a. The revolu- tion of multimodal large language models: A survey. InProceedings of the Annual Meeting of the Associa- tion for Computational Linguistic...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Aug- menting multimodal llms with self-reflective tokens for knowledge-based visual ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InProceedings of the Conference on Empirical Methods in Natural Language Processing

Snapntell: Enhancing entity-centric visual question answering with retrieval augmented mul- timodal llm. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-okvqa: A benchmark for visual question answer- ing using world know...

2022
[4]

EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

Garage: A benchmark with grounding an- notations for rag evaluation. InProceedings of the Annual Meeting of the Association for Computational Linguistics. Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, and Phillip Howard. 2025. Sk-vqa: Synthetic knowledge generation at scale for training context- augmented multimodal llms. InProceedings of the I...

work page arXiv 2025
[5]

The latest national survey by Pew Research Center, conducted Dec. 8–13 among 1,500 adults, finds that since the start of this year, the share of Americans who say the government is doing well in reducing the threat of terrorism has fallen by 26 percentage points – from72%to46%
[6]

46”, “27

Compared to early 2015, assessments of government efforts to combat terrorism are more negative across the political spectrum. Democrats are now the only partisan group in which a majority(64%)say the government is doing at least fairly well (down from 85% in January). Independents’ positive ratings have dropped 25 points, from69%to44%. And just27%of Repu...

2015