Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification
Pith reviewed 2026-06-26 21:56 UTC · model grok-4.3
The pith
Counter-evidence verification detects and corrects hallucinations in medical vision-language models by checking each statement against its supporting image region.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoEV is a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. It performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining.
What carries the argument
Bidirectional verification between textual assertions and their corresponding visual evidence regions, organized into a four-quadrant map that classifies statements by factuality and grounding.
If this is right
- Hallucination detection improves average PR-AUC by 3.0 percent and ROC-AUC by 3.9 percent across four medical datasets.
- Detection gains reach up to 18.5 percent in specific medical VQA scenarios.
- Hallucination correction improves Micro-F1 by up to 12.5 percent.
- Hallucination rates on medical report generation drop by more than 11.9 percent.
- Medical VQA accuracy increases as a result of the corrections.
Where Pith is reading between the lines
- If region extraction proves reliable, the same verification step could be inserted into pipelines for non-medical image captioning to reduce unsupported claims.
- The quadrant map could serve as a visual aid that lets a clinician quickly flag which sentences in an AI-generated report lack image support.
- The approach might lower the cost of deploying VLMs in medicine by avoiding the need for domain-specific retraining on every new dataset.
- Integrating the verification signal back into model decoding could test whether hallucinations can be prevented at generation time rather than corrected afterward.
Load-bearing premise
It is possible to reliably identify and extract the specific visual evidence region corresponding to each textual statement so that the bidirectional check can accurately test support.
What would settle it
A controlled test in which the extracted visual regions are deliberately replaced with unrelated image patches and the method's detection metrics fall to near-random levels.
Figures
read the original abstract
Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Counter-Evidence Verification (CoEV), a training-free, plug-and-play framework for detecting and correcting hallucinations in medical vision-language models (VLMs). CoEV conducts bidirectional verification between textual assertions and their corresponding visual evidence regions to assign statements to a four-quadrant map based on combinations of text factuality and visual grounding. Experiments across four medical datasets report consistent outperformance in hallucination detection (average +3.0% PR-AUC, +3.9% ROC-AUC) and correction (up to +12.5% Micro-F1, -11.9% hallucination rate).
Significance. If the central claims hold, the work provides a practical, retraining-free approach to mitigating hallucinations in medical VLMs, which could improve trustworthiness in clinical applications. The method's emphasis on evidence-based verification addresses a noted gap in existing attention-based or inconsistency-focused approaches.
major comments (1)
- [Framework description (bidirectional verification and four-quadrant map)] The four-quadrant classification depends on accurate identification and extraction of the specific visual evidence region for each textual statement. The manuscript does not report any validation, ablation, or error analysis of this region localization step, which is load-bearing for the detection performance claims, especially given the challenges of small, overlapping, or low-contrast anatomy in medical images.
minor comments (2)
- [Abstract] Formatting error in abstract: 'Co}unter-Evidence' should read 'Counter-Evidence'.
- [Abstract] Missing space in abstract: 'VLMs.For hallucination detection' should read 'VLMs. For hallucination detection'.
Simulated Author's Rebuttal
We thank the referee for the constructive comment. We address it directly below and will revise the manuscript to incorporate additional analysis.
read point-by-point responses
-
Referee: [Framework description (bidirectional verification and four-quadrant map)] The four-quadrant classification depends on accurate identification and extraction of the specific visual evidence region for each textual statement. The manuscript does not report any validation, ablation, or error analysis of this region localization step, which is load-bearing for the detection performance claims, especially given the challenges of small, overlapping, or low-contrast anatomy in medical images.
Authors: We agree that the region localization step is central to the four-quadrant map and that the manuscript does not report dedicated validation, ablation, or error analysis of this component. The bidirectional verification relies on accurate mapping of statements to evidence regions, and we will add an ablation study evaluating localization accuracy (where ground-truth regions are available) along with error analysis focused on small, overlapping, or low-contrast medical structures. This will be included in the revised manuscript to better substantiate the detection claims. revision: yes
Circularity Check
No circularity: training-free framework with no fitted parameters or self-referential derivations
full rationale
The paper describes CoEV as a training-free, plug-and-play method that performs bidirectional verification between textual assertions and visual evidence regions to populate a four-quadrant map. No equations, parameter fitting, or self-citations are presented that would reduce the reported AUC or F1 gains to the input data by construction. The central claims rest on external empirical evaluation across four datasets rather than any definitional loop or renamed input. The extraction of evidence regions is an implementation assumption, not a self-defining step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Applied Sciences12(8), 3846 (2022)
An, J., Joe, I.: Attention map-guided visual explanations for deep neural networks. Applied Sciences12(8), 3846 (2022)
2022
-
[2]
Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., Oh, J., Ji, L., Chang, E., Kim, T., et al.: Mimic-ext-mimic-cxr-vqa: A complex, diverse, and large-scale visual question answering dataset for chest x-ray images (2024)
2024
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024
Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)
-
[5]
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Chen, J., Yang, D., Wu, T., Jiang, Y., Hou, X., Li, M., Wang, S., Xiao, D., Li, K., Zhang, L.: Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Journal of the American Medical Informatics Association23(2), 304–310 (2015)
Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol- ogy examinations for distribution and retrieval. Journal of the American Medical Informatics Association23(2), 304–310 (2015)
2015
-
[7]
Nature630(8017), 625–630 (2024)
Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large lan- guage models using semantic entropy. Nature630(8017), 625–630 (2024)
2024
-
[8]
Advanced Intelligent Systems p
Gu, Z., Chen, J., Liu, F., Yin, C., Zhang, P.: Medvh: Toward systematic evaluation of hallucination for large vision language models in the medical context. Advanced Intelligent Systems p. 2500255 (2025)
2025
-
[9]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18135–18143 (2024)
2024
-
[10]
Scientific data6(1), 317 (2019)
Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)
2019
-
[11]
Scientific data 5(1), 1–10 (2018)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)
2018
-
[12]
Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Nan Zhou et al., submission to MICCAI 2026 review
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Nan Zhou et al., submission to MICCAI 2026 review
2023
-
[13]
In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention
Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention. pp. 669–679. Springer (2025)
2025
-
[14]
In: Text sum- marization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)
2004
-
[15]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision
Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 21310–21320 (2025)
2025
-
[16]
A Survey on Hallucination in Large Vision-Language Models
Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
arXiv preprint arXiv:2509.04492 (2025)
Moslonka, C., Randrianarivo, H., Garnier, A., Malherbe, E.: Learned hallucina- tion detection in black-box llms using token-level entropy production rate. arXiv preprint arXiv:2509.04492 (2025)
-
[18]
In: Findings of the association for computational linguistics: EMNLP 2024
Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generative radiology report evaluation and error notation. In: Findings of the association for computational linguistics: EMNLP 2024. pp. 374–390 (2024)
2024
-
[19]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
2002
-
[20]
Multimedia Tools and Applications83(19), 57551–57578 (2024)
Raghavan, K., B, S., v, K.: Attention guided grad-cam: an improved explainable artificial intelligence model for infrared breast cancer detection. Multimedia Tools and Applications83(19), 57551–57578 (2024)
2024
-
[21]
Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
arXiv preprint arXiv:2306.07971 (2023)
Thawkar, O., Shaker, A., Mullappilly, S.S., Cholakkal, H., Anwer, R.M., Khan, S., Laaksonen, J., Khan, F.S.: Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 (2023)
-
[23]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25543–25551 (2025)
2025
-
[24]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
arXiv preprint arXiv:2411.00299 (2024)
Zhang, S., Sambara, S., Banerjee, O., Acosta, J., Fahrner, L.J., Rajpurkar, P.: Radflag: A black-box hallucination detection method for medical vision language models. arXiv preprint arXiv:2411.00299 (2024)
-
[26]
Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv:2303.00915 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
arXiv e-prints pp
Zou, K., Bai, Y., Chen, Z., Zhou, Y., Chen, Y., Ren, K., Wang, M., Yuan, X., Shen, X., Fu, H.: Medrg: Medical report grounding with multi-modal large language model. arXiv e-prints pp. arXiv–2404 (2024)
2024
-
[29]
Zou, K., Bai, Y., Liu, B., Chen, Y., Chen, Z., Zhou, Y., Yuan, X., Wang, M., Shen, X., Cao, X., et al.: Uncertainty-aware medical diagnostic phrase identification and grounding.IEEETransactionsonPatternAnalysisandMachineIntelligence(2025)
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.