arxiv: 2511.01458 · v2 · submitted 2025-11-03 · 💻 cs.CV · cs.AI

When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

Luca Carlini , Dennis Pierantozzi , Mauro Orazio Drago , Chiara Lena , Cesare Hassan , Elena De Momi , Danail Stoyanov , Sophia Bano

show 1 more author

Mobarak I. Hoque

This is my paper

Pith reviewed 2026-05-18 01:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords surgical VQAuncertainty estimationsemantic entropyquestion alignmentvisual question answeringmedical AI safetyEndoVis18-VQAzero-shot models

0 comments

The pith

Question-aligned semantic entropy improves detection of trustworthy answers in surgical VQA by weighting responses for question relevance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QA-SNNE to address a gap in uncertainty estimation for surgical visual question answering. Standard semantic nearest neighbor entropy ignores the specific question, so it can overstate confidence for answers that are internally consistent but off-topic for the clinical query. QA-SNNE adds bilateral gating that scales pairwise answer similarities by their alignment with the question, using embedding, entailment, or cross-encoder scores. When tested on EndoVis18-VQA, this raises AUROC for zero-shot models and holds some gains when questions are rephrased while images and ground-truth answers stay fixed.

Core claim

QA-SNNE is a black-box uncertainty estimator that incorporates question-answer alignment into semantic entropy through bilateral gating. It measures uncertainty by weighting pairwise semantic similarities among sampled answers according to their relevance to the question, using embedding-based, entailment-based, or cross-encoder alignment strategies. Evaluation on five VQA models across two surgical datasets shows AUROC gains for two of three zero-shot models in-template and up to 8 percent improvement under out-of-template rephrasing.

What carries the argument

Question-Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a bilateral gating mechanism that multiplies semantic similarity scores by question-answer alignment scores before entropy calculation.

If this is right

QA-SNNE supplies a model-agnostic safeguard that links semantic uncertainty directly to question relevance in surgical VQA.
Gains appear in zero-shot settings and persist under controlled question rephrasing on EndoVis18-VQA.
The estimator remains practical because it requires only sampled answers and no model internals.
Results are mixed on external validation, indicating dataset-specific limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

QA-SNNE could be tested on live operating-room logs to measure whether uncertainty flags match actual surgeon overrides.
Combining the method with parameter-efficient fine-tuning might stabilize performance when both model weights and question phrasing vary.
The bilateral gating idea suggests similar alignment steps could improve uncertainty estimates in other medical question-answering tasks where intent shifts with wording.

Load-bearing premise

The out-of-template rephrased dataset, created by modifying only question wording while keeping images and ground-truth answers unchanged, sufficiently captures real clinical variation in question phrasing that would affect model behavior and uncertainty estimation.

What would settle it

A direct comparison of QA-SNNE uncertainty scores against clinician judgments on live surgical cases where the same image receives both the benchmark question and a naturally varied clinical phrasing, checking whether lower uncertainty still tracks with correct answers.

read the original abstract

Safety and reliability are critical for deploying visual question answering (VQA) systems in surgery, where incorrect or ambiguous responses can cause patient harm. A key limitation of existing uncertainty estimation methods, such as Semantic Nearest Neighbor Entropy (SNNE), is that they do not explicitly account for the conditioning question. As a result, they may assign high confidence to answers that are semantically consistent yet misaligned with the clinical question, especially under variation in question phrasing. We propose Question-Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black-box uncertainty estimator that incorporates question-answer alignment into semantic entropy through bilateral gating. QA-SNNE measures uncertainty by weighting pairwise semantic similarities among sampled answers according to their relevance to the question, using embedding-based, entailment-based, or cross-encoder alignment strategies. To assess robustness to language variation, we construct an out-of-template rephrased version of a benchmark surgical VQA dataset, where only the question wording is modified while images and ground-truth answers remain unchanged. We evaluate QA-SNNE on five VQA models across two benchmark surgical VQA datasets in both zero-shot and parameter-efficient fine-tuned (PEFT) settings, including out-of-template questions. QA-SNNE improves AUROC on EndoVis18-VQA for two of three zero-shot models in-template (e.g., +15% for Llama3.2 and +21% for Qwen2.5) and achieves up to +8% AUROC improvement under out-of-template rephrasing, with mixed results on external validation. Overall, QA-SNNE provides a practical, model-agnostic safeguard for surgical VQA by linking semantic uncertainty to question relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Question-Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black-box uncertainty estimator for surgical VQA that extends Semantic Nearest Neighbor Entropy by incorporating explicit question-answer alignment via bilateral gating. Alignment is computed using embedding-based, entailment-based, or cross-encoder strategies to weight pairwise semantic similarities among sampled answers. The method is evaluated on five VQA models (including Llama3.2 and Qwen2.5) across EndoVis18-VQA and a second benchmark dataset, in both zero-shot and PEFT settings. To test robustness to language variation, the authors construct an out-of-template rephrased version of EndoVis18-VQA by modifying only question wording while fixing images and ground-truth answers. Reported results include AUROC gains of +15% for Llama3.2 and +21% for Qwen2.5 on in-template zero-shot cases, up to +8% under out-of-template rephrasing, and mixed external validation outcomes.

Significance. If the AUROC improvements prove statistically robust and the out-of-template rephrasings adequately proxy real clinical question variation, QA-SNNE would provide a practical, model-agnostic safeguard that links semantic uncertainty estimation to question relevance, addressing a clear safety gap in surgical VQA deployment. The black-box nature and use of off-the-shelf alignment components are strengths that could facilitate adoption.

major comments (2)

The robustness claim for QA-SNNE under language variation (up to +8% AUROC) rests on the out-of-template rephrased dataset. This dataset is created by rewording questions while keeping images and ground-truth answers fixed; the manuscript should demonstrate that these synthetic changes produce answer distributions and uncertainty shifts comparable to authentic clinical rephrasings, rather than assuming they do. Without such validation or qualitative examples of model behavior changes, the practical safety benefit is not yet load-bearing.
Abstract and results: the reported AUROC gains (e.g., +15% for Llama3.2, +21% for Qwen2.5) are presented without error bars, statistical significance tests, or the exact number of samples used for each setting. This information is required to assess whether the improvements are reliable rather than within noise, especially given the mixed external validation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate additional analysis and statistical reporting where feasible.

read point-by-point responses

Referee: The robustness claim for QA-SNNE under language variation (up to +8% AUROC) rests on the out-of-template rephrased dataset. This dataset is created by rewording questions while keeping images and ground-truth answers fixed; the manuscript should demonstrate that these synthetic changes produce answer distributions and uncertainty shifts comparable to authentic clinical rephrasings, rather than assuming they do. Without such validation or qualitative examples of model behavior changes, the practical safety benefit is not yet load-bearing.

Authors: We agree that further substantiation of the rephrased dataset would strengthen the robustness claims. In the revised manuscript we have added qualitative examples of rephrased questions together with the corresponding shifts in sampled answer distributions and uncertainty estimates for representative cases. We have also expanded the discussion to explain the design rationale: by holding images and ground-truth answers fixed, the construction isolates the effect of linguistic variation on model behavior in a controlled manner. While a direct side-by-side comparison with a corpus of authentic clinical rephrasings would be ideal, such data are not publicly available and would require substantial new collection; we therefore present the current proxy as a first step toward evaluating sensitivity to question wording rather than a complete surrogate for all clinical variation. revision: yes
Referee: Abstract and results: the reported AUROC gains (e.g., +15% for Llama3.2, +21% for Qwen2.5) are presented without error bars, statistical significance tests, or the exact number of samples used for each setting. This information is required to assess whether the improvements are reliable rather than within noise, especially given the mixed external validation results.

Authors: We appreciate this observation. The revised manuscript now includes error bars (standard deviation across three independent sampling runs), results of paired statistical significance tests (Wilcoxon signed-rank) for the reported AUROC differences, and the precise number of questions evaluated in each zero-shot, PEFT, in-template, and out-of-template setting. These details have been added to the abstract, results section, and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity in QA-SNNE proposal or evaluation chain

full rationale

The paper defines QA-SNNE explicitly as an extension of existing SNNE via bilateral gating and alignment modules (embedding, entailment, or cross-encoder), then evaluates the resulting estimator on standard benchmarks plus a synthetically rephrased out-of-template set. No equation or step reduces the reported AUROC gains to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on a self-citation whose content is itself unverified or defined by the present work. The out-of-template construction is an independent experimental choice whose validity can be assessed externally; it does not create a self-definitional loop. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; method appears to introduce alignment scoring and bilateral gating whose exact parameters or thresholds are not specified here. No explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5878 in / 1162 out tokens · 36515 ms · 2026-05-18T01:18:00.556215+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
cs.CV 2025-11 conditional novelty 5.0

SurgViVQA adds temporal video encoding to surgical VideoQA and reports 9-11% gains in keyword accuracy over image-only baselines on two datasets plus improved robustness to question rephrasing.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Seenivasan, L., Islam, M., Krishna, A.K., Ren, H.: Surgical-vqa: Visual question answering in surgical scenes using transformer. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 33–43 (2022)

work page 2022
[2]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

He, R., Xu, M., Das, A., Khan, D.Z., Bano, S., Marcus, H.J., Stoyanov, D., Clarkson, M.J., Islam, M.: Pitvqa: Image-grounded text embedding llm for visual 10 question answering in pituitary surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 488–498 (2024)

work page 2024
[3]

arXiv preprint arXiv:2502.14149 (2025)

He, R., Khan, D.Z., Mazomenos, E.B., Marcus, H.J., Stoyanov, D., Clarkson, M.J., Islam, M.: Pitvqa++: Vector matrix-low-rank adaptation for open-ended visual question answering in pituitary surgery. arXiv preprint arXiv:2502.14149 (2025)

work page arXiv 2025
[4]

ACM Computing Surveys (2025)

Shorinwa, O., Mei, Z., Lidard, J., Ren, A.Z., Majumdar, A.: A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions. ACM Computing Surveys (2025)

work page 2025
[5]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Nature630(8017), 625–630 (2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630(8017), 625–630 (2024)

work page 2024
[7]

arXiv preprint arXiv:2506.00245 (2025)

Nguyen, D., Payani, A., Mirzasoleiman, B.: Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity. arXiv preprint arXiv:2506.00245 (2025)

work page arXiv 2025
[8]

arXiv preprint arXiv:2005.04118 (2020)

Ribeiro, M.T., Wu, T., Guestrin, C., Singh, S.: Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118 (2020)

work page arXiv 2005
[9]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)

work page 2018
[10]

arXiv preprint arXiv:2408.05767 (2024)

Li, Q., Geng, J., Lyu, C., Zhu, D., Panov, M., Karray, F.: Reference-free hallucina- tion detection for large vision-language models. arXiv preprint arXiv:2408.05767 (2024)

work page arXiv 2024
[11]

arXiv preprint arXiv:2411.18659 (2024)

Zhang, Y., Xie, R., Sun, X., Huang, Y., Chen, J., Kang, Z., Wang, D., Wang, Y.: Dhcp: Detecting hallucinations by cross-modal attention pattern in large vision- language models. arXiv preprint arXiv:2411.18659 (2024)

work page arXiv 2024
[12]

Science China Information Sciences67(12), 220105 (2024)

Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)

work page 2024
[13]

arXiv preprint arXiv:2411.11919 (2024) 11

Zhang, R., Zhang, H., Zheng, Z.: Vl-uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation. arXiv preprint arXiv:2411.11919 (2024) 11

work page arXiv 2024
[14]

arXiv preprint arXiv:2508.01781 (2025)

Cossio, M.: A comprehensive taxonomy of hallucinations in large language models. arXiv preprint arXiv:2508.01781 (2025)

work page arXiv 2025
[15]

Pattern Recognition110, 107332 (2021)

Ma, X., Niu, Y., Gu, L., Wang, Y., Zhao, Y., Bailey, J., Lu, F.: Understand- ing adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition110, 107332 (2021)

work page 2021
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14303–14312 (2024)

work page 2024
[17]

Journal of Data Intelligence3(4), 474–505 (2022)

Deka, P., Jurek-Loughrey, A., Padmanabhan, D.: Improved methods to aid unsu- pervised evidence-based fact checking for online health news. Journal of Data Intelligence3(4), 474–505 (2022)

work page 2022
[18]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

He, P., Liu, X., Gao, J., Chen, W.: Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: BGE M3- Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embed- dings Through Self-Knowledge Distillation (2024)

work page 2024
[20]

arXiv e-prints, 2407 (2024)

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints, 2407 (2024)

work page 2024
[21]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp

Seenivasan, L., Islam, M., Kannan, G., Ren, H.: Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 281–290 (2023) 12

work page 2023