Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
Pith reviewed 2026-05-18 12:54 UTC · model grok-4.3
The pith
Quality scores for vision-language model explanations let users judge prediction accuracy without seeing the image.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Visual Fidelity and Contrastiveness scores provide a more reliable signal of whether a vision-language model prediction is correct than existing explanation quality measures, and that presenting these scores to users who cannot see the image improves their ability to detect incorrect predictions.
What carries the argument
Visual Fidelity and Contrastiveness scoring functions, which respectively measure how faithful an explanation is to the visual context and how distinctly it identifies features that separate the model's prediction from alternatives.
If this is right
- The scores correlate more strongly with model correctness than prior explanation quality measures on A-OKVQA, VizWiz, and MMMU-Pro.
- Users reach 11.1 percent higher accuracy when deciding if a VLM prediction is correct after seeing the scores.
- The rate at which users accept incorrect VLM predictions drops by 15.4 percent when scores are displayed.
- Displaying the scores promotes more appropriate reliance on VLM predictions for users without visual access.
Where Pith is reading between the lines
- Interface designers for image-description tools could embed these scores to let blind users decide when to trust or question an output.
- Model developers might use the scores during training or evaluation to reward explanations that are both faithful and distinctive.
- The same scoring approach could be tested on other multimodal models beyond the three tasks studied here to check broader applicability.
Load-bearing premise
The Visual Fidelity and Contrastiveness scoring functions can be computed reliably from VLM outputs and visual context, and the user study participants' behavior generalizes to target populations such as blind and low-vision users who rely on explanations alone.
What would settle it
A follow-up user study with blind and low-vision participants that finds no accuracy gain when quality scores are shown alongside explanations would falsify the practical benefit.
Figures
read the original abstract
When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA, VizWiz, and MMMU-Pro tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two complementary quality scoring functions for VLM explanations—Visual Fidelity (measuring faithfulness to visual context) and Contrastiveness (measuring distinction from plausible alternatives)—to help users without visual access assess prediction reliability. These scores are evaluated on A-OKVQA, VizWiz, and MMMU-Pro, where they are reported to be better calibrated with model correctness than prior explanation qualities. A user study shows that displaying the scores alongside explanations improves participants' accuracy at judging VLM correctness by 11.1% and reduces false belief in incorrect predictions by 15.4% in a no-visual-context setting.
Significance. If the calibration and user-study results hold, the work provides a practical mechanism to reduce overreliance on VLMs for accessibility-critical users. The multi-task benchmark evaluation and controlled user study constitute concrete, falsifiable evidence that explanation quality scores can improve appropriate reliance; this is a strength relative to purely qualitative prior work on VLM explanations.
major comments (3)
- [User Study] User Study section: the headline 11.1% accuracy gain and 15.4% false-belief reduction are measured with participants who are not described as belonging to the target population of blind or low-vision users. Because the study withholds images from (presumably sighted) participants, it is unclear whether the observed effect sizes would replicate for users whose only information is the explanation plus scores; this is load-bearing for the central claim of utility for the intended users.
- [§3] Abstract and §3 (Quality Scoring Functions): the manuscript reports that the proposed scores are “better calibrated” than baselines, yet provides no explicit formulas, pseudocode, or parameter definitions for Visual Fidelity and Contrastiveness, nor any statistical tests (e.g., confidence intervals or p-values) for the 11.1% and 15.4% figures. Without these details the calibration and user-study claims cannot be fully verified or reproduced.
- [Evaluation] Evaluation sections: no ablation or control is reported for potential confounds such as explanation length, lexical overlap with the prediction, or participant fatigue in the user study. These omissions weaken the attribution of the accuracy improvement specifically to the proposed quality scores.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly stated how the two scoring functions are computed from VLM outputs and visual context.
- Ensure that any tables reporting calibration metrics include the exact number of instances per task and the baseline methods used for comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [User Study] User Study section: the headline 11.1% accuracy gain and 15.4% false-belief reduction are measured with participants who are not described as belonging to the target population of blind or low-vision users. Because the study withholds images from (presumably sighted) participants, it is unclear whether the observed effect sizes would replicate for users whose only information is the explanation plus scores; this is load-bearing for the central claim of utility for the intended users.
Authors: We recognize the value of testing with the target user population. Our study intentionally used a no-visual-context paradigm with sighted participants to control for variables and directly assess the impact of quality scores when visual information is unavailable, which is the core scenario for blind and low-vision users. We will revise the manuscript to explicitly discuss this as a limitation and to emphasize that the results provide evidence for the mechanism in the relevant information setting. Future studies with actual blind or low-vision participants are planned as follow-up work. revision: partial
-
Referee: [§3] Abstract and §3 (Quality Scoring Functions): the manuscript reports that the proposed scores are “better calibrated” than baselines, yet provides no explicit formulas, pseudocode, or parameter definitions for Visual Fidelity and Contrastiveness, nor any statistical tests (e.g., confidence intervals or p-values) for the 11.1% and 15.4% figures. Without these details the calibration and user-study claims cannot be fully verified or reproduced.
Authors: We apologize for any ambiguity in the presentation. The full manuscript includes definitions in Section 3, but to improve reproducibility, we will add explicit mathematical formulas, pseudocode, and parameter details for Visual Fidelity and Contrastiveness in the revised version. We will also include statistical tests, such as confidence intervals and p-values, for the reported accuracy gains and false-belief reductions from the user study. revision: yes
-
Referee: [Evaluation] Evaluation sections: no ablation or control is reported for potential confounds such as explanation length, lexical overlap with the prediction, or participant fatigue in the user study. These omissions weaken the attribution of the accuracy improvement specifically to the proposed quality scores.
Authors: We agree that ruling out confounds strengthens the conclusions. In the revised manuscript, we will incorporate additional analyses to control for explanation length and lexical overlap, for example by reporting correlations or partial correlations with these factors. Regarding participant fatigue, we will provide more details on the experimental design, including counterbalancing and session structure, and any post-hoc checks for order effects. revision: yes
Circularity Check
No significant circularity; scoring functions and user-study gains are independently defined and measured
full rationale
The paper defines Visual Fidelity and Contrastiveness scoring functions directly from VLM outputs and visual context without fitting them to the evaluation outcomes. These scores are then assessed for calibration on held-out tasks (A-OKVQA, VizWiz, MMMU-Pro) and their utility is measured via a separate user study reporting an 11.1% accuracy improvement. No derivation step reduces the proposed scores or the reported gains to quantities that are fitted from or defined in terms of the same data; the chain remains self-contained with external empirical validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Explanation quality can be decomposed into visual fidelity and contrastiveness dimensions that are measurable from model outputs and image context.
Reference graph
Works this paper leans on
-
[1]
Anisha Gunjal, Jihan Yin, and Erhan Bas
URLhttps://arxiv.org/abs/2212.07919. Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI Conference on Artificial Intelligence, 2024. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pp. 13...
-
[2]
Aligning Text-to-Image Models using Human Feedback
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.50. URL https://aclanthology.org/2024.acl-long.50/. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback, 2023. URLhttps://arxiv.org/abs/2302.1219...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.50 2024
-
[3]
Reframing Human- AI Collaboration for Generating Free-Text Explanations
URLhttps://arxiv.org/abs/2403.09552. Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vifidel: Evaluating the visual fidelity of image descriptions, 2019. URLhttps://arxiv.org/abs/1907.09340. Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. Receval: Evaluating reasoning chains via correctness and informativeness, 2023. URL https://arxiv.or...
-
[4]
Answer Only: Participants viewed only the question, answer choices (if available), and model prediction
-
[5]
With Explanation: Participants were provided with AI-generated rationales alongside pre- dictions
-
[6]
Visual Fidelity and Contrastiveness) were displayed alongside explanations
With Explanation + Quality: Qualities (varied from our experiment settings, e.g. Visual Fidelity and Contrastiveness) were displayed alongside explanations. This three-stage design of the user study enables us to track how users’ confidence in the model’s correctness evolves as they receive additional information. Timed Stages in Supplementary Human Studi...
-
[7]
Answer Only: fixed 5 seconds
-
[8]
With Explanation: explanation reading time (words / 238 wpm) (roughly 10–40 seconds)
-
[9]
Explanation + Quality: fixed 5 seconds Bonus Payments in Supplementary Human StudiesParticipants were paid a $2 base fee and could earn up to $1 in performance-based bonuses, which were awarded only during Stage 3 (Explanation + Quality; see Section G.3.2). 18 Table 16 shows that as users progress from seeing only the model’s answer to viewing explanation...
work page 2024
-
[10]
Is the person wearing a helmet while riding a bicycle? Reason: This question is directly answerable by observing whether the person on the bicycle is wearing a helmet in the image
-
[11]
Is the street in the image busy with traffic? Reason: This question can be visually verified by looking at the amount of traffic on the street in the image. Bad Questions:
-
[12]
Is the person wearing the helmet because they are concerned about head injuries? Reason: This question is not good because it assumes the person’s intentions or concerns, which cannot be visually verified from the image
-
[13]
Does wearing a helmet suggest that the person is highly safety-conscious? Reason: This question relies on inference and external knowledge about the person’s mindset, rather than on observable details from the image
-
[14]
Is there any indication that the person is wearing a helmet for safety reasons? Reason: This question verifies the answer to the original question, rather than verifying a detail about the image that’s mentioned in the rationale
-
[15]
Is the person wearing a safety vest? Reason: This question is not good because it tries to verify details about the image that are not explicitly mentioned in the rationale
-
[16]
Does the explanation provide evidence that matches with the answer it gives?
Is the person not wearing sunglasses? Reason: This question is not good because it asks for verification by absence and can only be answered with a "no," which is not the preferred type of question. Respond with a list of (good) questions (without the reasons), starting from ‘1. ’ 20 Table 7: Model configuration and prompt used to verify the visual questi...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.