Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3
The pith
Med-StepBench decomposes clinical reasoning into four diagnostic stages to detect hallucinations step by step in medical vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Med-StepBench is the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT that comprises over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data. It decomposes clinical reasoning into four expert-designed diagnostic stages and uses clinician-verified annotations to perform the first step-level evaluation of general-purpose and medical VLMs. The evaluation reveals systematic failure modes that aggregate accuracy metrics obscure and shows that current VLMs remain highly susceptible to adversarial yet clinically plausible intermediate explanations that amplify hallucinations despite contradictory visual cues
What carries the argument
The four expert-designed diagnostic stages that decompose clinical reasoning into successive steps for localization, abnormality identification, and related inferences in PET/CT data.
If this is right
- Aggregate accuracy metrics conceal critical reasoning errors that only appear at individual diagnostic stages.
- Vision-language models become significantly more prone to hallucinations when supplied with plausible but incorrect intermediate explanations.
- Step-level evaluation is required to identify where models fail to ground multi-step clinical reasoning in visual evidence.
- The benchmark supplies a concrete testbed for training and evaluating safer medical vision-language models that handle intermediate steps reliably.
Where Pith is reading between the lines
- Similar staged decomposition could be applied to other medical imaging modalities to expose comparable hidden failure modes.
- Models trained with explicit supervision aligned to each diagnostic stage might reduce downstream hallucination rates.
- Clinical workflows could adopt step-wise verification checkpoints derived from the benchmark to audit AI outputs in practice.
Load-bearing premise
The four expert-designed stages form a complete and unbiased decomposition of clinical reasoning and the clinician-verified annotations supply reliable ground truth without selection or interpretation biases.
What would settle it
A replication study in which independent clinicians re-label the same image-statement pairs and produce substantially different stage assignments would show that the annotations do not provide stable ground truth.
Figures
read the original abstract
Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Med-StepBench, a benchmark for step-wise hallucination detection in vision-language models on 3D oncological PET/CT scans. It comprises over 12,000 images and more than 1,000,000 image-statement pairs spanning volumetric and multi-view 2D data, decomposes clinical reasoning into four expert-designed diagnostic stages with clinician-verified annotations, evaluates general-purpose and medical VLMs to expose systematic failure modes hidden by aggregate metrics, and demonstrates model susceptibility to adversarial yet clinically plausible intermediate explanations that amplify hallucinations despite contradictory visual evidence.
Significance. If the four-stage decomposition proves representative and the annotations reliable, the benchmark fills a gap in existing medical hallucination evaluations by enabling granular assessment of multi-step reasoning in 3D imaging rather than one-shot 2D diagnosis. This could support development of safer VLMs by identifying grounding failures at specific diagnostic stages and testing robustness to plausible adversarial inputs.
major comments (3)
- [§3.1] §3.1 (Benchmark Construction): The four expert-designed diagnostic stages are presented as a decomposition of clinical reasoning without reported coverage analysis against independent radiologist workflows, inter-stage dependency checks, or quantification of how often real diagnostic sequences deviate from the prescribed order; this is load-bearing because the step-level hallucination labels and reported failure modes depend directly on the taxonomy being complete and unbiased.
- [§4.2] §4.2 (Clinician Verification): The manuscript states that annotations are clinician-verified but provides insufficient detail on inter-clinician agreement metrics, number of reviewers per statement, or resolution of ambiguous cases across the >1,000,000 pairs; without these, the reliability of the ground-truth step labels cannot be assessed and the central claim of revealing intrinsic model limitations is weakened.
- [§5.3] §5.3 (Adversarial Evaluation): The claim that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations requires explicit description of how these explanations were generated, how their clinical plausibility was independently verified, and quantitative comparison of hallucination rates with and without the adversarial step; the current presentation leaves open whether the amplification effect is an artifact of the stage taxonomy.
minor comments (2)
- [Abstract] Abstract: The phrase 'first large-scale benchmark' should be qualified with a brief comparison to the scale and scope of prior medical hallucination benchmarks to avoid overstatement.
- [Figure 2] Figure 2 and Table 1: Legends and captions should explicitly distinguish results on volumetric data versus multi-view 2D projections to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the work presented and indicating where revisions will strengthen the paper.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Benchmark Construction): The four expert-designed diagnostic stages are presented as a decomposition of clinical reasoning without reported coverage analysis against independent radiologist workflows, inter-stage dependency checks, or quantification of how often real diagnostic sequences deviate from the prescribed order; this is load-bearing because the step-level hallucination labels and reported failure modes depend directly on the taxonomy being complete and unbiased.
Authors: We appreciate the referee's emphasis on rigorously validating the stage taxonomy. The four stages were developed iteratively with expert oncological radiologists to mirror standard hierarchical reasoning in PET/CT interpretation, informed by established clinical protocols. The original submission did not include a formal coverage analysis or inter-stage dependency quantification. In the revised manuscript, we will expand §3.1 to include a detailed rationale with references to radiological guidelines, along with results from a pilot evaluation using independent radiologist annotations to assess coverage and typical sequence deviations. This addition will better support the taxonomy's representativeness. revision: yes
-
Referee: [§4.2] §4.2 (Clinician Verification): The manuscript states that annotations are clinician-verified but provides insufficient detail on inter-clinician agreement metrics, number of reviewers per statement, or resolution of ambiguous cases across the >1,000,000 pairs; without these, the reliability of the ground-truth step labels cannot be assessed and the central claim of revealing intrinsic model limitations is weakened.
Authors: We agree that additional methodological details are required to substantiate annotation reliability. The revised §4.2 will specify the verification protocol, including the number of clinicians, inter-rater agreement metrics, and procedures for resolving ambiguous cases. Given the dataset scale, we will clarify that verification combined comprehensive review of a stratified sample with targeted checks for the full set, enabling readers to evaluate ground-truth quality and the robustness of our findings on model limitations. revision: yes
-
Referee: [§5.3] §5.3 (Adversarial Evaluation): The claim that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations requires explicit description of how these explanations were generated, how their clinical plausibility was independently verified, and quantitative comparison of hallucination rates with and without the adversarial step; the current presentation leaves open whether the amplification effect is an artifact of the stage taxonomy.
Authors: We will revise §5.3 to fully detail the adversarial explanation generation process, including the constrained prompting approach used to produce clinically plausible intermediates. We will also describe the independent clinician verification for plausibility and add quantitative comparisons of hallucination rates with versus without the adversarial steps, supported by statistical tests. These enhancements will demonstrate the amplification effect while addressing potential taxonomy-related artifacts. revision: yes
Circularity Check
No circularity: benchmark construction with no derivations or self-referential predictions
full rationale
The paper introduces Med-StepBench as a new dataset and evaluation framework for step-wise hallucination detection in medical VLMs, decomposing clinical reasoning into four expert-designed stages with clinician-verified annotations. No mathematical equations, fitted parameters, predictions of related quantities, or load-bearing self-citations appear in the provided text or abstract. The four-stage decomposition is presented as an expert design choice rather than a result derived from prior inputs or self-referential theorems. The work is self-contained empirical contribution (new images, statements, and annotations) with no reduction of any claim to its own construction, consistent with absence of any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The four expert-designed diagnostic stages accurately and completely capture clinical reasoning for oncological PET/CT interpretation.
- domain assumption Clinician-verified annotations provide unbiased and reliable ground truth for detecting hallucinations at each step.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decomposes clinical reasoning into four expert-designed diagnostic stages: Anatomical Mapping, Lesion Identification, Feature Characterization, and Diagnostic Synthesis
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
step-wise hallucination evaluation procedure that mirrors the clinical reasoning pipeline
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[Asgariet al., 2025 ] Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clin- ical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8(1):274,
work page 2025
-
[2]
[Baeet al., 2023 ] Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neu- ral Information Processing Systems, 36:3867–3880,
work page 2023
-
[3]
M3d: Advancing 3d medical image analysis with multi-modal large language models,
[Baiet al., 2024 ] Fan Bai, Yuxin Du, Tiejun Huang, Max Qinghu Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models,
work page 2024
-
[4]
[Baiet al., 2025 ] Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report,
work page 2025
-
[5]
[Ben Abachaet al., 2021 ] Asma Ben Abacha, Mourad Sar- routi, Dina Demner-Fushman, Sadid A. Hasan, and Hen- ning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InCLEF 2021 Conference and Labs of the Evaluation Forum - Working Notes,
work page 2021
-
[6]
[Changet al., 2025 ] Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. Medheval: Benchmarking hallucinations and mitigation strategies in medical large vision–language models,
work page 2025
-
[7]
Detecting and evaluating medical hallucinations in large vision language models,
[Chenet al., 2024 ] Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models,
work page 2024
-
[8]
[Comaniciet al., 2025 ] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al
Med-HallMark benchmark. [Comaniciet al., 2025 ] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,
work page 2025
-
[9]
[Guet al., 2026 ] Zishan Gu, Jiayuan Chen, Fenglin Liu, Changchang Yin, and Ping Zhang. Medvh: Toward sys- tematic evaluation of hallucination for large vision lan- guage models in the medical context.Advanced Intelligent Systems, 8(1):2500255,
work page 2026
-
[10]
[Heet al., 2020 ] Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint,
work page 2020
-
[11]
[Hofman and Hicks, 2016] Michael S. Hofman and Rod- ney J. Hicks. How we read oncologic FDG PET/CT.Can- cer Imaging, 16(1):35,
work page 2016
-
[12]
Rrg-mamba: Efficient ra- diology report generation with state space model
[Houet al., 2025 ] Xiaodi Hou, Xiaobo Li, Mingyu Lu, Simiao Wang, and Yijia Zhang. Rrg-mamba: Efficient ra- diology report generation with state space model. In James Kwok, editor,Proceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 25, pages 7410–7418. International Joint Conferences on Artificial Intellige...
work page 2025
-
[13]
Main Track. [Jianget al., 2025 ] Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu. Modality-fair preference optimiza- tion for trustworthy mllm alignment. In James Kwok, edi- tor,Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 403–411...
work page 2025
-
[14]
Available: https://arxiv.org/abs/2503.05777
Main Track. [Kimet al., 2025 ] Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Al- hamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their im- pact on healthcare.arXiv preprint arXiv:2503.05777,
-
[15]
[Lauet al., 2018 ] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about ra- diology images.Scientific data, 5(1):1–10,
work page 2018
-
[16]
Llava- med: Training a large language-and-vision assistant for biomedicine in one day,
[Liet al., 2023 ] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava- med: Training a large language-and-vision assistant for biomedicine in one day,
work page 2023
-
[17]
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering
[Liuet al., 2021 ] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE,
work page 2021
-
[18]
[Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning,
work page 2023
-
[19]
Argus: benchmarking and enhancing vision-language models for 3d radiology report genera- tion
[Liuet al., 2025 ] Che Liu, Zhongwei Wan, Yuqi Wang, Hui Shen, Haozhe Wang, Kangyu Zheng, Mi Zhang, and Rossella Arcucci. Argus: benchmarking and enhancing vision-language models for 3d radiology report genera- tion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 16448–16460,
work page 2025
-
[20]
Med- flamingo: a multimodal medical few-shot learner
[Mooret al., 2023 ] Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med- flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR,
work page 2023
- [21]
-
[22]
Medm- vl: What makes a good medical lvlm?,
[Shiet al., 2025 ] Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, and Ji Wu. Medm- vl: What makes a good medical lvlm?,
work page 2025
-
[23]
Understanding vi- sual detail hallucinations of large vision-language mod- els
[Sunet al., 2025 ] Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, and Dongyan Zhao. Understanding vi- sual detail hallucinations of large vision-language mod- els. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 1900–1908,
work page 2025
-
[24]
Medklip: Medical knowl- edge enhanced language-image pre-training for x-ray di- agnosis
[Wuet al., 2023 ] Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowl- edge enhanced language-image pre-training for x-ray di- agnosis. InProceedings of the IEEE/CVF international conference on computer vision, pages 21372–21383,
work page 2023
-
[25]
Hallucination benchmark in medical visual question an- swering
[Wuet al., 2024 ] Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question an- swering. InThe Second Tiny Papers Track at ICLR 2024,
work page 2024
-
[26]
[Xiaet al., 2024 ] Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Sys- tems, 37:140334–140365,
work page 2024
-
[27]
[Yanet al., 2025 ] Qianqi Yan, Xuehai He, Xiang Yue, and Xin Eric Wang. Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 19188–19205,
work page 2025
-
[28]
Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint,
[Zhanget al., 2023 ] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint,
work page 2023
-
[29]
[Zhouet al., 2025 ] Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, and Xuegong Zhang. DrVD-bench: Do vision-language models reason like hu- man doctors in medical image diagnosis? InThe Thirty- ninth Annual Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track,
work page 2025
-
[30]
[Zhuet al., 2025 ] Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, Qingqing Long, Yefeng Zheng, and Xian Wu. Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models. In Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of t...
work page 2025
-
[31]
[Zuo and Jiang, 2025] Kaiwen Zuo and Yirui Jiang
Association for Compu- tational Linguistics. [Zuo and Jiang, 2025] Kaiwen Zuo and Yirui Jiang. Med- hallbench: A new benchmark for assessing hallucination in medical large language models. InAAAI Bridge Pro- gram on AI for Medicine and Healthcare, pages 205–213. PMLR, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.