Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

Brihi Joshi; Jesse Thomason; Keyu He; Swabha Swayamdipta; Tejas Srinivasan; Xiang Ren

arxiv: 2509.25844 · v3 · submitted 2025-09-30 · 💻 cs.CL · cs.HC

Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

Keyu He , Tejas Srinivasan , Brihi Joshi , Xiang Ren , Jesse Thomason , Swabha Swayamdipta This is my paper

Pith reviewed 2026-05-18 12:54 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords vision-language modelsexplanation qualityvisual fidelitycontrastivenessuser reliancemodel correctnessA-OKVQAVizWiz

0 comments

The pith

Quality scores for vision-language model explanations let users judge prediction accuracy without seeing the image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two scoring functions called Visual Fidelity and Contrastiveness to rate the quality of explanations from vision-language models. Visual Fidelity checks how closely an explanation matches the actual visual input, while Contrastiveness checks how well it highlights details that support the model's choice over other plausible options. These scores turn out to be more closely tied to whether the model is right than earlier ways of judging explanations, across the A-OKVQA, VizWiz, and MMMU-Pro tasks. In a study where people had to say if a model prediction was correct without seeing the picture, showing the scores raised accuracy by 11.1 percent and lowered the rate of wrongly accepting bad predictions by 15.4 percent. The goal is to help users avoid over-trusting vision-language models when they must rely on text alone.

Core claim

The paper claims that Visual Fidelity and Contrastiveness scores provide a more reliable signal of whether a vision-language model prediction is correct than existing explanation quality measures, and that presenting these scores to users who cannot see the image improves their ability to detect incorrect predictions.

What carries the argument

Visual Fidelity and Contrastiveness scoring functions, which respectively measure how faithful an explanation is to the visual context and how distinctly it identifies features that separate the model's prediction from alternatives.

If this is right

The scores correlate more strongly with model correctness than prior explanation quality measures on A-OKVQA, VizWiz, and MMMU-Pro.
Users reach 11.1 percent higher accuracy when deciding if a VLM prediction is correct after seeing the scores.
The rate at which users accept incorrect VLM predictions drops by 15.4 percent when scores are displayed.
Displaying the scores promotes more appropriate reliance on VLM predictions for users without visual access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interface designers for image-description tools could embed these scores to let blind users decide when to trust or question an output.
Model developers might use the scores during training or evaluation to reward explanations that are both faithful and distinctive.
The same scoring approach could be tested on other multimodal models beyond the three tasks studied here to check broader applicability.

Load-bearing premise

The Visual Fidelity and Contrastiveness scoring functions can be computed reliably from VLM outputs and visual context, and the user study participants' behavior generalizes to target populations such as blind and low-vision users who rely on explanations alone.

What would settle it

A follow-up user study with blind and low-vision participants that finds no accuracy gain when quality scores are shown alongside explanations would falsify the practical benefit.

Figures

Figures reproduced from arXiv: 2509.25844 by Brihi Joshi, Jesse Thomason, Keyu He, Swabha Swayamdipta, Tejas Srinivasan, Xiang Ren.

**Figure 2.** Figure 2: Calibration curves for various quality scoring functions when evaluating explanations [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Our study interface where users are shown [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of showing users different quality scores on User Accuracy, Over-Reliance and Under [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on presentation types on A-OKVQA dataset: holding the numeric signal fixed (Prod(VF, Contr.), VF, or Contr.), we vary the on-screen label (shown as VF, shown as Contr, or simple confidence). User Accuracy (↑) and Over-Reliance (↓) are effectively unchanged by naming/framing. in over-reliance, and a 5.3% reduction in under-reliance. These gains suggest that quality signals help users not only bette… view at source ↗

**Figure 6.** Figure 6: Effect of showing numeric and descriptive qualities on User Accuracy and Over-Reliance. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Here, the study interface shows the partic [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Relationship between ECE of different quality scores and their downstream utility to users. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Explanation quality messages for each instruction condition. Subfigure [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA, VizWiz, and MMMU-Pro tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines two new scores for VLM explanations that improve user accuracy in judging model correctness without images, but the user study leaves the gains for actual blind and low-vision users untested.

read the letter

The main takeaway is that Visual Fidelity and Contrastiveness scores, when shown with VLM explanations, lift participants' accuracy at spotting correct predictions by 11.1% and cut false belief in wrong ones by 15.4% on a no-image task. The scores also calibrate better to model correctness than prior explanation metrics across A-OKVQA, VizWiz, and MMMU-Pro. That is the concrete advance: a pair of functions that try to capture faithfulness to the image and ability to rule out alternatives, plus an initial human test of their effect on trust calibration. The focus on users who cannot see the visual context is a direct response to a documented problem with explanations alone convincing people of bad outputs. The benchmark results give some evidence that these scores track correctness more reliably than earlier measures. The user study is the clearest limitation. It withholds images from participants but gives no details on who those participants are or whether they match the target blind and low-vision population. Sighted people who simply lack the image in the moment may use the scores differently than users whose only information has ever been textual. Without demographics, ablations, or a follow-up with the intended group, the reported effect sizes rest on an unverified assumption about generalization. The abstract is also light on the exact computation of the scores and on statistical tests or controls. This work is for researchers in multimodal explanation and accessibility who need practical ways to reduce overreliance. A reader looking for new scoring ideas or empirical setups around calibrated trust would find usable pieces here. It is solid enough on the problem framing and the proposed functions to merit a serious referee, even with the open questions on the study design. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The paper proposes two complementary quality scoring functions for VLM explanations—Visual Fidelity (measuring faithfulness to visual context) and Contrastiveness (measuring distinction from plausible alternatives)—to help users without visual access assess prediction reliability. These scores are evaluated on A-OKVQA, VizWiz, and MMMU-Pro, where they are reported to be better calibrated with model correctness than prior explanation qualities. A user study shows that displaying the scores alongside explanations improves participants' accuracy at judging VLM correctness by 11.1% and reduces false belief in incorrect predictions by 15.4% in a no-visual-context setting.

Significance. If the calibration and user-study results hold, the work provides a practical mechanism to reduce overreliance on VLMs for accessibility-critical users. The multi-task benchmark evaluation and controlled user study constitute concrete, falsifiable evidence that explanation quality scores can improve appropriate reliance; this is a strength relative to purely qualitative prior work on VLM explanations.

major comments (3)

[User Study] User Study section: the headline 11.1% accuracy gain and 15.4% false-belief reduction are measured with participants who are not described as belonging to the target population of blind or low-vision users. Because the study withholds images from (presumably sighted) participants, it is unclear whether the observed effect sizes would replicate for users whose only information is the explanation plus scores; this is load-bearing for the central claim of utility for the intended users.
[§3] Abstract and §3 (Quality Scoring Functions): the manuscript reports that the proposed scores are “better calibrated” than baselines, yet provides no explicit formulas, pseudocode, or parameter definitions for Visual Fidelity and Contrastiveness, nor any statistical tests (e.g., confidence intervals or p-values) for the 11.1% and 15.4% figures. Without these details the calibration and user-study claims cannot be fully verified or reproduced.
[Evaluation] Evaluation sections: no ablation or control is reported for potential confounds such as explanation length, lexical overlap with the prediction, or participant fatigue in the user study. These omissions weaken the attribution of the accuracy improvement specifically to the proposed quality scores.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly stated how the two scoring functions are computed from VLM outputs and visual context.
Ensure that any tables reporting calibration metrics include the exact number of instances per task and the baseline methods used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: [User Study] User Study section: the headline 11.1% accuracy gain and 15.4% false-belief reduction are measured with participants who are not described as belonging to the target population of blind or low-vision users. Because the study withholds images from (presumably sighted) participants, it is unclear whether the observed effect sizes would replicate for users whose only information is the explanation plus scores; this is load-bearing for the central claim of utility for the intended users.

Authors: We recognize the value of testing with the target user population. Our study intentionally used a no-visual-context paradigm with sighted participants to control for variables and directly assess the impact of quality scores when visual information is unavailable, which is the core scenario for blind and low-vision users. We will revise the manuscript to explicitly discuss this as a limitation and to emphasize that the results provide evidence for the mechanism in the relevant information setting. Future studies with actual blind or low-vision participants are planned as follow-up work. revision: partial
Referee: [§3] Abstract and §3 (Quality Scoring Functions): the manuscript reports that the proposed scores are “better calibrated” than baselines, yet provides no explicit formulas, pseudocode, or parameter definitions for Visual Fidelity and Contrastiveness, nor any statistical tests (e.g., confidence intervals or p-values) for the 11.1% and 15.4% figures. Without these details the calibration and user-study claims cannot be fully verified or reproduced.

Authors: We apologize for any ambiguity in the presentation. The full manuscript includes definitions in Section 3, but to improve reproducibility, we will add explicit mathematical formulas, pseudocode, and parameter details for Visual Fidelity and Contrastiveness in the revised version. We will also include statistical tests, such as confidence intervals and p-values, for the reported accuracy gains and false-belief reductions from the user study. revision: yes
Referee: [Evaluation] Evaluation sections: no ablation or control is reported for potential confounds such as explanation length, lexical overlap with the prediction, or participant fatigue in the user study. These omissions weaken the attribution of the accuracy improvement specifically to the proposed quality scores.

Authors: We agree that ruling out confounds strengthens the conclusions. In the revised manuscript, we will incorporate additional analyses to control for explanation length and lexical overlap, for example by reporting correlations or partial correlations with these factors. Regarding participant fatigue, we will provide more details on the experimental design, including counterbalancing and session structure, and any post-hoc checks for order effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scoring functions and user-study gains are independently defined and measured

full rationale

The paper defines Visual Fidelity and Contrastiveness scoring functions directly from VLM outputs and visual context without fitting them to the evaluation outcomes. These scores are then assessed for calibration on held-out tasks (A-OKVQA, VizWiz, MMMU-Pro) and their utility is measured via a separate user study reporting an 11.1% accuracy improvement. No derivation step reduces the proposed scores or the reported gains to quantities that are fitted from or defined in terms of the same data; the chain remains self-contained with external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that explanation quality can be quantified independently of user perception and that benchmark tasks adequately proxy real-world use cases for blind users.

axioms (1)

domain assumption Explanation quality can be decomposed into visual fidelity and contrastiveness dimensions that are measurable from model outputs and image context.
Invoked when proposing the two scoring functions as remedies for overreliance.

pith-pipeline@v0.9.0 · 5774 in / 1226 out tokens · 36068 ms · 2026-05-18T12:54:49.428514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Anisha Gunjal, Jihan Yin, and Erhan Bas

URLhttps://arxiv.org/abs/2212.07919. Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI Conference on Artificial Intelligence, 2024. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pp. 13...

work page arXiv 2024
[2]

Aligning Text-to-Image Models using Human Feedback

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.50. URL https://aclanthology.org/2024.acl-long.50/. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback, 2023. URLhttps://arxiv.org/abs/2302.1219...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.50 2024
[3]

Reframing Human- AI Collaboration for Generating Free-Text Explanations

URLhttps://arxiv.org/abs/2403.09552. Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vifidel: Evaluating the visual fidelity of image descriptions, 2019. URLhttps://arxiv.org/abs/1907.09340. Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. Receval: Evaluating reasoning chains via correctness and informativeness, 2023. URL https://arxiv.or...

work page doi:10.18653/v1/2022.naacl-main.47 2019
[4]

Answer Only: Participants viewed only the question, answer choices (if available), and model prediction

work page
[5]

With Explanation: Participants were provided with AI-generated rationales alongside pre- dictions

work page
[6]

Visual Fidelity and Contrastiveness) were displayed alongside explanations

With Explanation + Quality: Qualities (varied from our experiment settings, e.g. Visual Fidelity and Contrastiveness) were displayed alongside explanations. This three-stage design of the user study enables us to track how users’ confidence in the model’s correctness evolves as they receive additional information. Timed Stages in Supplementary Human Studi...

work page
[7]

Answer Only: fixed 5 seconds

work page
[8]

With Explanation: explanation reading time (words / 238 wpm) (roughly 10–40 seconds)

work page
[9]

Explanation + Quality: fixed 5 seconds Bonus Payments in Supplementary Human StudiesParticipants were paid a $2 base fee and could earn up to $1 in performance-based bonuses, which were awarded only during Stage 3 (Explanation + Quality; see Section G.3.2). 18 Table 16 shows that as users progress from seeing only the model’s answer to viewing explanation...

work page 2024
[10]

Is the person wearing a helmet while riding a bicycle? Reason: This question is directly answerable by observing whether the person on the bicycle is wearing a helmet in the image

work page
[11]

Bad Questions:

Is the street in the image busy with traffic? Reason: This question can be visually verified by looking at the amount of traffic on the street in the image. Bad Questions:

work page
[12]

Is the person wearing the helmet because they are concerned about head injuries? Reason: This question is not good because it assumes the person’s intentions or concerns, which cannot be visually verified from the image

work page
[13]

Does wearing a helmet suggest that the person is highly safety-conscious? Reason: This question relies on inference and external knowledge about the person’s mindset, rather than on observable details from the image

work page
[14]

Is there any indication that the person is wearing a helmet for safety reasons? Reason: This question verifies the answer to the original question, rather than verifying a detail about the image that’s mentioned in the rationale

work page
[15]

Is the person wearing a safety vest? Reason: This question is not good because it tries to verify details about the image that are not explicitly mentioned in the rationale

work page
[16]

Does the explanation provide evidence that matches with the answer it gives?

Is the person not wearing sunglasses? Reason: This question is not good because it asks for verification by absence and can only be answered with a "no," which is not the preferred type of question. Respond with a list of (good) questions (without the reasons), starting from ‘1. ’ 20 Table 7: Model configuration and prompt used to verify the visual questi...

work page 2024

[1] [1]

Anisha Gunjal, Jihan Yin, and Erhan Bas

URLhttps://arxiv.org/abs/2212.07919. Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI Conference on Artificial Intelligence, 2024. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pp. 13...

work page arXiv 2024

[2] [2]

Aligning Text-to-Image Models using Human Feedback

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.50. URL https://aclanthology.org/2024.acl-long.50/. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback, 2023. URLhttps://arxiv.org/abs/2302.1219...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.50 2024

[3] [3]

Reframing Human- AI Collaboration for Generating Free-Text Explanations

URLhttps://arxiv.org/abs/2403.09552. Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vifidel: Evaluating the visual fidelity of image descriptions, 2019. URLhttps://arxiv.org/abs/1907.09340. Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. Receval: Evaluating reasoning chains via correctness and informativeness, 2023. URL https://arxiv.or...

work page doi:10.18653/v1/2022.naacl-main.47 2019

[4] [4]

Answer Only: Participants viewed only the question, answer choices (if available), and model prediction

work page

[5] [5]

With Explanation: Participants were provided with AI-generated rationales alongside pre- dictions

work page

[6] [6]

Visual Fidelity and Contrastiveness) were displayed alongside explanations

With Explanation + Quality: Qualities (varied from our experiment settings, e.g. Visual Fidelity and Contrastiveness) were displayed alongside explanations. This three-stage design of the user study enables us to track how users’ confidence in the model’s correctness evolves as they receive additional information. Timed Stages in Supplementary Human Studi...

work page

[7] [7]

Answer Only: fixed 5 seconds

work page

[8] [8]

With Explanation: explanation reading time (words / 238 wpm) (roughly 10–40 seconds)

work page

[9] [9]

Explanation + Quality: fixed 5 seconds Bonus Payments in Supplementary Human StudiesParticipants were paid a $2 base fee and could earn up to $1 in performance-based bonuses, which were awarded only during Stage 3 (Explanation + Quality; see Section G.3.2). 18 Table 16 shows that as users progress from seeing only the model’s answer to viewing explanation...

work page 2024

[10] [10]

Is the person wearing a helmet while riding a bicycle? Reason: This question is directly answerable by observing whether the person on the bicycle is wearing a helmet in the image

work page

[11] [11]

Bad Questions:

Is the street in the image busy with traffic? Reason: This question can be visually verified by looking at the amount of traffic on the street in the image. Bad Questions:

work page

[12] [12]

Is the person wearing the helmet because they are concerned about head injuries? Reason: This question is not good because it assumes the person’s intentions or concerns, which cannot be visually verified from the image

work page

[13] [13]

Does wearing a helmet suggest that the person is highly safety-conscious? Reason: This question relies on inference and external knowledge about the person’s mindset, rather than on observable details from the image

work page

[14] [14]

Is there any indication that the person is wearing a helmet for safety reasons? Reason: This question verifies the answer to the original question, rather than verifying a detail about the image that’s mentioned in the rationale

work page

[15] [15]

Is the person wearing a safety vest? Reason: This question is not good because it tries to verify details about the image that are not explicitly mentioned in the rationale

work page

[16] [16]

Does the explanation provide evidence that matches with the answer it gives?

Is the person not wearing sunglasses? Reason: This question is not good because it asks for verification by absence and can only be answered with a "no," which is not the preferred type of question. Respond with a list of (good) questions (without the reasons), starting from ‘1. ’ 20 Table 7: Model configuration and prompt used to verify the visual questi...

work page 2024