VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

Benteng Ma; Jianfei Cai; Jiayi Chen; Winston Chong; Yasmeen George; Zehui Liao

arxiv: 2605.20772 · v1 · pith:DKKWX5WJnew · submitted 2026-05-20 · 💻 cs.CV

VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

Jiayi Chen , Benteng Ma , Zehui Liao , Winston Chong , Yasmeen George , Jianfei Cai This is my paper

Pith reviewed 2026-05-21 05:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords hallucination detectionmedical VQAmultimodal large language modelsvisual token maskingsemantic entropydecoder layersvisual dependency probing

0 comments

The pith

VIHD detects hallucinations in medical VQA by masking visual tokens in dominant decoder layers to produce a calibrated semantic entropy signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical multimodal models frequently generate answers that appear plausible yet lack grounding in the provided image, raising concerns for clinical reliability. The paper shows that locating decoder layers with strong visual influence and then masking the corresponding visual tokens during generation shifts the output distribution in a targeted way. This shift yields a calibrated semantic entropy value that more reliably indicates when an answer was not supported by visual evidence. Unlike prior approaches that apply broad perturbations or measure raw uncertainty, the intervention focuses on internal cross-modal dependencies at specific decoding steps. Results across three medical VQA benchmarks and two different models indicate higher detection accuracy than existing methods.

Core claim

The paper establishes that Visual Dependency Probing can identify decoder layers where visual tokens exert dominant influence, and that applying Visual Intervention Decoding by masking those tokens calibrates the semantic distribution such that the resulting Calibrated Semantic Entropy serves as a more effective hallucination signal than uncalibrated uncertainty estimates or heuristic input perturbations.

What carries the argument

Visual Dependency Probing to locate dominant layers, followed by targeted visual token masking in Visual Intervention Decoding to compute Calibrated Semantic Entropy as the hallucination indicator.

If this is right

The approach supplies a practical detection module that can be added to existing medical MLLMs without retraining.
It establishes that fine-grained, layer-specific visual dependencies during decoding carry information about response reliability.
Performance gains hold across different model architectures, suggesting the intervention targets a shared generation mechanism.
By quantifying the effect of removing visual input at critical steps, the method offers a direct measure of visual grounding strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer-wise probing could be applied to text-only components to map how linguistic dependencies interact with visual ones.
The intervention principle may extend to detecting other failure modes such as factual inconsistencies by masking different token types.
Strengthening connections in the identified dominant layers during training could reduce the frequency of hallucinations at the source.

Load-bearing premise

Masking visual tokens in the layers flagged by dependency probing isolates hallucination-related uncertainty rather than reflecting unrelated model behaviors or general entropy increases.

What would settle it

On a held-out medical VQA set, if the calibrated entropy scores fail to separate verified hallucinated responses from grounded ones at a statistically significant level, or if the probed layers show no measurable increase in visual-token influence compared with other layers.

Figures

Figures reproduced from arXiv: 2605.20772 by Benteng Ma, Jianfei Cai, Jiayi Chen, Winston Chong, Yasmeen George, Zehui Liao.

**Figure 2.** Figure 2: Variants of VDP. Random Low High 70 80 90 AUC (%) 82.2 77.64 83.13 70.92 67.66 78.23 Open All [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Effect of sliding window width. 5 10 25 50 Masking ratio (%) 70 75 80 85 AUC (%) Open All [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of masking ratio [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at https://github.com/Jiayi-Chen-AU/VIHD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIHD's targeted masking in visually dominant layers to produce calibrated semantic entropy is a reasonable step past generic uncertainty baselines, but the claim that it specifically flags visual hallucinations rests on unshown controls.

read the letter

The main point for you is that this paper proposes a pipeline of visual dependency probing to pick decoder layers, followed by token masking during intervention decoding, to generate a calibrated semantic entropy score that they say detects hallucinations better than prior methods in medical VQA. The abstract frames this as addressing the limits of context-agnostic perturbations by looking inside the cross-modal attention during generation.

Referee Report

3 major / 2 minor

Summary. The paper proposes VIHD, a Visual Intervention-based Hallucination Detection method for medical Visual Question Answering with Multimodal Large Language Models. It introduces Visual Dependency Probing (VDP) to locate visually dominant decoder layers, Visual Intervention Decoding (VID) that masks visual tokens in those layers to calibrate the semantic distribution, and Calibrated Semantic Entropy (CSE) derived from the intervention as a hallucination signal. The central claim is that this targeted internal cross-modal intervention yields more effective detection than prior uncertainty estimation or verification baselines, with extensive experiments on three medical VQA benchmarks using two medical MLLMs showing consistent outperformance.

Significance. If the core claim holds, the work offers a concrete advance in hallucination detection for safety-critical medical MLLMs by shifting from heuristic external perturbations to fine-grained internal visual-token interventions. Releasing code would strengthen reproducibility; the multi-benchmark, multi-model evaluation setup is a positive step toward practical clinical utility.

major comments (3)

[§3.2] §3.2 (Visual Dependency Probing): The description does not include direct evidence or ablations showing that layers selected by VDP produce entropy shifts that are larger precisely for hallucinated answers (lacking visual support) versus non-hallucinated but uncertain responses. Without such controls, it remains unclear whether VDP isolates hallucination-specific visual grounding or merely captures generic attention or prediction difficulty patterns.
[§4] §4 (Experiments): The claim of consistent outperformance over state-of-the-art methods is load-bearing for the contribution, yet the manuscript provides insufficient detail on exact evaluation metrics, baseline re-implementations, statistical significance tests, or controls for confounding effects introduced by the masking operation itself. This weakens the data-to-claim link.
[§3.3] §3.3 (Calibrated Semantic Entropy): The derivation of CSE from the VID-masked distribution is presented as a calibrated hallucination signal, but no analysis demonstrates that the entropy change is specifically discriminative for visual hallucinations rather than overall model uncertainty (e.g., ambiguous phrasing or low-confidence tokens unrelated to vision).

minor comments (2)

[§3] Notation for the entropy quantities and masking operation could be made more explicit, with a clear equation linking the pre- and post-intervention distributions.
Figure captions and axis labels in the experimental results should explicitly state the metrics and models used to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These observations have helped us strengthen the presentation of our method and experiments. We provide point-by-point responses below and have revised the manuscript to incorporate additional evidence and details where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Visual Dependency Probing): The description does not include direct evidence or ablations showing that layers selected by VDP produce entropy shifts that are larger precisely for hallucinated answers (lacking visual support) versus non-hallucinated but uncertain responses. Without such controls, it remains unclear whether VDP isolates hallucination-specific visual grounding or merely captures generic attention or prediction difficulty patterns.

Authors: We appreciate this observation. While the overall performance gains on hallucination detection tasks provide indirect support for the utility of VDP-selected layers, we agree that direct controls comparing entropy shifts on hallucinated versus non-hallucinated but uncertain responses would strengthen the claim that VDP isolates visual grounding. In the revised manuscript we have added an ablation study in Section 3.2 that reports entropy shift magnitudes for both categories across the probed layers, along with attention visualizations confirming higher visual-token dependency in the selected layers for hallucinated cases. These results indicate that the entropy changes are more pronounced when visual support is absent. revision: yes
Referee: [§4] §4 (Experiments): The claim of consistent outperformance over state-of-the-art methods is load-bearing for the contribution, yet the manuscript provides insufficient detail on exact evaluation metrics, baseline re-implementations, statistical significance tests, or controls for confounding effects introduced by the masking operation itself. This weakens the data-to-claim link.

Authors: We agree that greater experimental transparency is required. The revised Section 4 now specifies all evaluation metrics (AUC-ROC, F1, and precision-recall), provides exact re-implementation details for each baseline including hyper-parameters and random seeds, reports statistical significance via paired t-tests with p-values, and includes controls for the masking operation by comparing VID against random visual-token masking and non-visual interventions. These additions clarify that the observed gains are attributable to the targeted visual intervention rather than generic masking effects. revision: yes
Referee: [§3.3] §3.3 (Calibrated Semantic Entropy): The derivation of CSE from the VID-masked distribution is presented as a calibrated hallucination signal, but no analysis demonstrates that the entropy change is specifically discriminative for visual hallucinations rather than overall model uncertainty (e.g., ambiguous phrasing or low-confidence tokens unrelated to vision).

Authors: This is a fair point. The current derivation relies on the assumption that visual-token masking primarily affects visually grounded predictions, but we lacked explicit discrimination against non-visual uncertainty sources. In the revision we have added an analysis in Section 3.3 that partitions test cases into visual hallucinations, linguistic ambiguity, and low-confidence non-visual tokens, then compares the resulting CSE values. The analysis shows larger calibrated entropy shifts for visual hallucinations, supporting specificity to cross-modal grounding. Corresponding figures and discussion have been included. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces VIHD as a new pipeline consisting of Visual Dependency Probing (VDP) to identify visually dominant decoder layers, Visual Intervention Decoding (VID) via targeted token masking, and Calibrated Semantic Entropy (CSE) computed from the resulting distribution shift. These components are defined procedurally from first principles of cross-modal attention and entropy measurement rather than reducing the final hallucination signal or performance gain to a fitted parameter, self-referential definition, or prior self-citation by construction. The central claim of outperformance is supported by empirical results on three external medical VQA benchmarks with two MLLMs, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the method is described procedurally at a high level without equations or implementation details that would reveal fitted values or new postulates.

pith-pipeline@v0.9.0 · 5770 in / 1119 out tokens · 56706 ms · 2026-05-21T05:49:04.499335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

[1]

Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-Med: Overview of the medical visual question answering task at imageclef

work page
[2]

In: CLEF. vol. 2380 (2019)

work page 2019
[3]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Chen, J., Yang, D., Wu, T., Jiang, Y., Hou, X., Li, M., Wang, S., Xiao, D., Li, K., Zhang, L.: Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Cohen, R., Hamri, M., Geva, M., Globerson, A.: LM vs LM: Detecting factual errors via cross examination. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 12621–12640 (2023)

work page 2023
[5]

Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: ACL. pp. 889–898 (2018)

work page 2018
[6]

Nature630(8017), 625–630 (2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large lan- guage models using semantic entropy. Nature630(8017), 625–630 (2024)

work page 2024
[7]

In: AAAI

Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI. vol. 38, pp. 18135–18143 (2024)

work page 2024
[8]

In: ICLR (2021)

He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: Decoding-enhanced bert with dis- entangled attention. In: ICLR (2021)

work page 2021
[9]

In: ICLR (2020) 10 J

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: ICLR (2020) 10 J. Chen et al

work page 2020
[10]

In: ICLR (2024)

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., Zhao, P.: Self-introspective decod- ing: Alleviating hallucinations for large vision-language models. In: ICLR (2024)

work page 2024
[11]

Jiang, Y

Jiang, S., Wang, Y., Song, S., Hu, T., Zhou, C., Pu, B., Zhang, Y., Yang, Z., Feng, Y., Zhou, J.T., et al.: Hulu-Med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2601.18240 (2026)

Jin, M., Liao, Z., Xia, Y.: V-Loop: Visual logical loop verification for hallucination detection in medical visual question answering. arXiv preprint arXiv:2601.18240 (2026)

work page arXiv 2026
[13]

In: ICLR (2023)

Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In: ICLR (2023)

work page 2023
[14]

Scientific Data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5(1), 1–10 (2018)

work page 2018
[15]

In: CVPR

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882 (2024)

work page 2024
[16]

NeurIPS36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS36, 28541–28564 (2023)

work page 2023
[17]

In: EMNLP

Li, Q., Geng, J., Lyu, C., Zhu, D., Panov, M., Karray, F.: Reference-free hallu- cination detection for large vision-language models. In: EMNLP. pp. 4542–4551 (2024)

work page 2024
[18]

In: MIC- CAI

Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: MIC- CAI. pp. 669–679. Springer (2025)

work page 2025
[19]

In: ISBI

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: ISBI. pp. 1650–1654. IEEE (2021)

work page 2021
[20]

arXiv preprint arXiv:2502.00290 (2025)

Ma, H., Chen, J., Zhou, J.T., Wang, G., Zhang, C.: Estimating LLM uncertainty with evidence. arXiv preprint arXiv:2502.00290 (2025)

work page arXiv 2025
[21]

In: EMNLP

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: GREEN: Generative radiology report evaluation and error notation. In: EMNLP. pp. 374–390 (2024)

work page 2024
[22]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: MedGemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

In: ICLR (2025)

Sun, Z., Zang, X., Zheng, K., Xu, J., Zhang, X., Yu, W., Song, Y., Li, H.: Re- DeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability. In: ICLR (2025)

work page 2025
[24]

Wu, J., Liu, Q., Wang, D., Zhang, J., Wu, S., Wang, L., Tan, T.: Logical closed loop: Uncovering object hallucinations in large vision-language models. In: ACL. pp. 6944–6962 (2024)

work page 2024
[25]

In: AAAI

Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine- grained ai feedback. In: AAAI. vol. 39, pp. 25543–25551 (2025)

work page 2025
[26]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

In: ICML

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: MM- Vet: Evaluating large multimodal models for integrated capabilities. In: ICML. pp. 57730–57754. PMLR (2024) VIHD 11

work page 2024
[28]

In: EMNLP

Zhang, J., Li, Z., Das, K., Malin, B., Kumar, S.: SAC3: reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. In: EMNLP. pp. 15445–15458 (2023)

work page 2023
[29]

arXiv preprint arXiv:2411.11919 (2024)

Zhang, R., Zhang, H., Zheng, Z.: Vl-uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation. arXiv preprint arXiv:2411.11919 (2024)

work page arXiv 2024
[30]

arXiv preprint arXiv:2411.00299 (2024)

Zhang, S., Sambara, S., Banerjee, O., Acosta, J., Fahrner, L.J., Rajpurkar, P.: RadFlag: A black-box hallucination detection method for medical vision language models. arXiv preprint arXiv:2411.00299 (2024)

work page arXiv 2024

[1] [1]

Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-Med: Overview of the medical visual question answering task at imageclef

work page

[2] [2]

In: CLEF. vol. 2380 (2019)

work page 2019

[3] [3]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Chen, J., Yang, D., Wu, T., Jiang, Y., Hou, X., Li, M., Wang, S., Xiao, D., Li, K., Zhang, L.: Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Cohen, R., Hamri, M., Geva, M., Globerson, A.: LM vs LM: Detecting factual errors via cross examination. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 12621–12640 (2023)

work page 2023

[5] [5]

Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: ACL. pp. 889–898 (2018)

work page 2018

[6] [6]

Nature630(8017), 625–630 (2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large lan- guage models using semantic entropy. Nature630(8017), 625–630 (2024)

work page 2024

[7] [7]

In: AAAI

Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI. vol. 38, pp. 18135–18143 (2024)

work page 2024

[8] [8]

In: ICLR (2021)

He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: Decoding-enhanced bert with dis- entangled attention. In: ICLR (2021)

work page 2021

[9] [9]

In: ICLR (2020) 10 J

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: ICLR (2020) 10 J. Chen et al

work page 2020

[10] [10]

In: ICLR (2024)

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., Zhao, P.: Self-introspective decod- ing: Alleviating hallucinations for large vision-language models. In: ICLR (2024)

work page 2024

[11] [11]

Jiang, Y

Jiang, S., Wang, Y., Song, S., Hu, T., Zhou, C., Pu, B., Zhang, Y., Yang, Z., Feng, Y., Zhou, J.T., et al.: Hulu-Med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2601.18240 (2026)

Jin, M., Liao, Z., Xia, Y.: V-Loop: Visual logical loop verification for hallucination detection in medical visual question answering. arXiv preprint arXiv:2601.18240 (2026)

work page arXiv 2026

[13] [13]

In: ICLR (2023)

Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In: ICLR (2023)

work page 2023

[14] [14]

Scientific Data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5(1), 1–10 (2018)

work page 2018

[15] [15]

In: CVPR

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882 (2024)

work page 2024

[16] [16]

NeurIPS36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS36, 28541–28564 (2023)

work page 2023

[17] [17]

In: EMNLP

Li, Q., Geng, J., Lyu, C., Zhu, D., Panov, M., Karray, F.: Reference-free hallu- cination detection for large vision-language models. In: EMNLP. pp. 4542–4551 (2024)

work page 2024

[18] [18]

In: MIC- CAI

Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: MIC- CAI. pp. 669–679. Springer (2025)

work page 2025

[19] [19]

In: ISBI

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: ISBI. pp. 1650–1654. IEEE (2021)

work page 2021

[20] [20]

arXiv preprint arXiv:2502.00290 (2025)

Ma, H., Chen, J., Zhou, J.T., Wang, G., Zhang, C.: Estimating LLM uncertainty with evidence. arXiv preprint arXiv:2502.00290 (2025)

work page arXiv 2025

[21] [21]

In: EMNLP

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: GREEN: Generative radiology report evaluation and error notation. In: EMNLP. pp. 374–390 (2024)

work page 2024

[22] [22]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: MedGemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

In: ICLR (2025)

Sun, Z., Zang, X., Zheng, K., Xu, J., Zhang, X., Yu, W., Song, Y., Li, H.: Re- DeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability. In: ICLR (2025)

work page 2025

[24] [24]

Wu, J., Liu, Q., Wang, D., Zhang, J., Wu, S., Wang, L., Tan, T.: Logical closed loop: Uncovering object hallucinations in large vision-language models. In: ACL. pp. 6944–6962 (2024)

work page 2024

[25] [25]

In: AAAI

Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine- grained ai feedback. In: AAAI. vol. 39, pp. 25543–25551 (2025)

work page 2025

[26] [26]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

In: ICML

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: MM- Vet: Evaluating large multimodal models for integrated capabilities. In: ICML. pp. 57730–57754. PMLR (2024) VIHD 11

work page 2024

[28] [28]

In: EMNLP

Zhang, J., Li, Z., Das, K., Malin, B., Kumar, S.: SAC3: reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. In: EMNLP. pp. 15445–15458 (2023)

work page 2023

[29] [29]

arXiv preprint arXiv:2411.11919 (2024)

Zhang, R., Zhang, H., Zheng, Z.: Vl-uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation. arXiv preprint arXiv:2411.11919 (2024)

work page arXiv 2024

[30] [30]

arXiv preprint arXiv:2411.00299 (2024)

Zhang, S., Sambara, S., Banerjee, O., Acosta, J., Fahrner, L.J., Rajpurkar, P.: RadFlag: A black-box hallucination detection method for medical vision language models. arXiv preprint arXiv:2411.00299 (2024)

work page arXiv 2024