Recognition: unknown
Benchmarking Deflection and Hallucination in Large Vision-Language Models
Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3
The pith
Large vision-language models usually fail to deflect when presented with noisy or misleading multimodal evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that across twenty state-of-the-art large vision-language models, models usually fail to produce deflections such as 'Sorry, I cannot answer' when the retrieved evidence is noisy or misleading. This behavior is measured with VLM-DeflectionBench, a set of 2,775 samples that cover diverse multimodal retrieval settings, and with a fine-grained protocol of four scenarios that isolate parametric memorization from retrieval-dependent reasoning. A dynamic curation pipeline keeps the benchmark difficult by continuously filtering for questions that cannot be answered from model parameters alone.
What carries the argument
VLM-DeflectionBench together with its four-scenario evaluation protocol, which separates memorized answers from responses that depend on the quality of retrieved visual and textual evidence.
Load-bearing premise
The dynamic curation pipeline correctly identifies questions that genuinely require retrieval and the four scenarios cleanly distinguish memorized knowledge from retrieval behavior.
What would settle it
A new set of models that deflect correctly on the majority of the benchmark samples while still answering non-retrieval questions accurately would falsify the reported failure to deflect.
Figures
read the original abstract
Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dynamic data curation pipeline to maintain benchmark difficulty over time by selecting genuinely retrieval-dependent multimodal questions, introduces VLM-DeflectionBench consisting of 2,775 samples across diverse retrieval settings, defines a fine-grained evaluation protocol with four scenarios intended to disentangle parametric memorization from retrieval behavior, and reports experimental results on 20 state-of-the-art LVLMs showing that models generally fail to generate deflections when presented with noisy, conflicting, or insufficient evidence.
Significance. If the curation pipeline and scenario separation are shown to be valid, the benchmark would offer a timely and reusable resource for evaluating LVLM reliability in knowledge-based visual QA, emphasizing the importance of deflection behaviors alongside accuracy and directly addressing the rapid obsolescence of static benchmarks due to model training data growth.
major comments (2)
- [dynamic data curation pipeline description] Dynamic data curation pipeline: The manuscript states that the pipeline filters for retrieval-dependent samples to preserve difficulty, yet provides no validation such as zero-shot accuracy on the curated questions without retrieved context, ablation comparing curated vs. non-curated sets, or metrics confirming low parametric contamination; this directly affects whether the central claim of retrieval-specific failure to deflect can be isolated from hallucination of memorized answers.
- [evaluation protocol and scenarios] Four scenarios and evaluation protocol: The scenarios are presented as cleanly disentangling parametric memorization from retrieval robustness, but the paper supplies no inter-scenario overlap statistics, confusion matrices, or ablation results demonstrating separation; without this, the fine-grained protocol's ability to support the reported findings on model behavior under noisy evidence remains unconfirmed.
minor comments (2)
- [abstract] The abstract and introduction would benefit from a brief table or paragraph summarizing the exact distribution of the 2,775 samples across the four scenarios, modalities, and evidence conflict types.
- [related work] Ensure all prior benchmarks on LVLM hallucination and deflection (e.g., those focused on visual QA or retrieval-augmented generation) are cited in the related work section for proper positioning.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [dynamic data curation pipeline description] Dynamic data curation pipeline: The manuscript states that the pipeline filters for retrieval-dependent samples to preserve difficulty, yet provides no validation such as zero-shot accuracy on the curated questions without retrieved context, ablation comparing curated vs. non-curated sets, or metrics confirming low parametric contamination; this directly affects whether the central claim of retrieval-specific failure to deflect can be isolated from hallucination of memorized answers.
Authors: We agree that additional validation would strengthen the isolation of retrieval-dependent behavior. In the revised manuscript, we will add zero-shot accuracy results on the curated questions without any retrieved context to demonstrate that models cannot answer them from parametric knowledge alone. We will also include an ablation comparing performance on the curated versus non-curated sets and report explicit metrics on parametric contamination, such as the fraction of questions answered correctly without retrieval. revision: yes
-
Referee: [evaluation protocol and scenarios] Four scenarios and evaluation protocol: The scenarios are presented as cleanly disentangling parametric memorization from retrieval robustness, but the paper supplies no inter-scenario overlap statistics, confusion matrices, or ablation results demonstrating separation; without this, the fine-grained protocol's ability to support the reported findings on model behavior under noisy evidence remains unconfirmed.
Authors: We acknowledge that empirical evidence of scenario separation would better support the protocol's validity. In the revision, we will provide inter-scenario overlap statistics, including sample distributions and any multi-label overlaps, along with confusion matrices or correlation analyses across the four scenarios. We will also add ablation results showing model performance differences across scenarios to confirm the protocol disentangles parametric memorization from retrieval robustness. revision: yes
Circularity Check
No circularity: empirical benchmark with procedural definitions and external validation
full rationale
The manuscript introduces VLM-DeflectionBench via a dynamic curation pipeline and four evaluation scenarios, but contains no derivations, equations, fitted parameters renamed as predictions, or self-referential definitions. Claims rest on experiments across 20 external LVLMs rather than reducing to the pipeline's own outputs by construction. The curation filter and disentanglement protocol are described as procedural filters without tautological closure (e.g., no claim that retrieval-dependence is proven solely by the filter's application). This is a standard self-contained empirical contribution whose central results are falsifiable against held-out models and data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LVLMs increasingly rely on retrieval to answer knowledge-intensive multimodal questions.
- domain assumption Existing benchmarks overlook conflicts between visual and textual evidence and suffer from rapid obsolescence.
Forward citations
Cited by 1 Pith paper
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Mar- cella Cornia, and Rita Cucchiara. 2024a. The revolu- tion of multimodal large language models: A survey. InProceedings of the Annual Meeting of the Associa- tion for Computational Linguistic...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Aug- menting multimodal llms with self-reflective tokens for knowledge-based visual ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
InProceedings of the Conference on Empirical Methods in Natural Language Processing
Snapntell: Enhancing entity-centric visual question answering with retrieval augmented mul- timodal llm. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-okvqa: A benchmark for visual question answer- ing using world know...
2022
-
[4]
EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024
Garage: A benchmark with grounding an- notations for rag evaluation. InProceedings of the Annual Meeting of the Association for Computational Linguistics. Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, and Phillip Howard. 2025. Sk-vqa: Synthetic knowledge generation at scale for training context- augmented multimodal llms. InProceedings of the I...
-
[5]
The latest national survey by Pew Research Center, conducted Dec. 8–13 among 1,500 adults, finds that since the start of this year, the share of Americans who say the government is doing well in reducing the threat of terrorism has fallen by 26 percentage points – from72%to46%
-
[6]
46”, “27
Compared to early 2015, assessments of government efforts to combat terrorism are more negative across the political spectrum. Democrats are now the only partisan group in which a majority(64%)say the government is doing at least fairly well (down from 85% in January). Independents’ positive ratings have dropped 25 points, from69%to44%. And just27%of Repu...
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.