pith. sign in

arxiv: 2605.10002 · v1 · submitted 2026-05-11 · 💻 cs.CV

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucination detectionvision-language modelsmedical imagingPET/CTstep-wise reasoningbenchmarkclinical reasoning3D oncological imaging
0
0 comments X

The pith

Med-StepBench decomposes clinical reasoning into four diagnostic stages to detect hallucinations step by step in medical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Med-StepBench as the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT imaging. It decomposes the diagnostic process into four expert-designed stages and supplies over 12,000 images with more than one million image-statement pairs that carry clinician-verified labels. This structure tests whether models correctly ground each intermediate claim rather than producing only a plausible final diagnosis. The evaluation shows systematic errors at individual stages and demonstrates that models readily adopt adversarial but clinically plausible intermediate statements even when visual evidence contradicts them. A sympathetic reader would care because aggregate accuracy scores hide these reasoning failures, and undetected errors could affect safety in clinical use of vision-language models.

Core claim

Med-StepBench is the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT that comprises over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data. It decomposes clinical reasoning into four expert-designed diagnostic stages and uses clinician-verified annotations to perform the first step-level evaluation of general-purpose and medical VLMs. The evaluation reveals systematic failure modes that aggregate accuracy metrics obscure and shows that current VLMs remain highly susceptible to adversarial yet clinically plausible intermediate explanations that amplify hallucinations despite contradictory visual cues

What carries the argument

The four expert-designed diagnostic stages that decompose clinical reasoning into successive steps for localization, abnormality identification, and related inferences in PET/CT data.

If this is right

  • Aggregate accuracy metrics conceal critical reasoning errors that only appear at individual diagnostic stages.
  • Vision-language models become significantly more prone to hallucinations when supplied with plausible but incorrect intermediate explanations.
  • Step-level evaluation is required to identify where models fail to ground multi-step clinical reasoning in visual evidence.
  • The benchmark supplies a concrete testbed for training and evaluating safer medical vision-language models that handle intermediate steps reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar staged decomposition could be applied to other medical imaging modalities to expose comparable hidden failure modes.
  • Models trained with explicit supervision aligned to each diagnostic stage might reduce downstream hallucination rates.
  • Clinical workflows could adopt step-wise verification checkpoints derived from the benchmark to audit AI outputs in practice.

Load-bearing premise

The four expert-designed stages form a complete and unbiased decomposition of clinical reasoning and the clinician-verified annotations supply reliable ground truth without selection or interpretation biases.

What would settle it

A replication study in which independent clinicians re-label the same image-statement pairs and produce substantially different stage assignments would show that the annotations do not provide stable ground truth.

Figures

Figures reproduced from arXiv: 2605.10002 by Amir Reza Jafari, Dai Lam Le, Mai Hong Son, Mai Huy Thong, Minh Khoi Nguyen, Noel Crespi, Phi Le Nguyen, Quang Huy Nguyen, Reza Farahbakhsh, Thanh Trung Nguyen, Tuan Dung Nguyen.

Figure 1
Figure 1. Figure 1: Qualitative examples from Med-StepBench illustrating [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PET/CT overlay method. Oncologists routinely examine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Step-wise hallucination evaluation framework. Decomposing each PET/CT case into four sequential reasoning stages and extracting clinician-verified step-level ground-truth statements. For each step, we generate paired hallucinated statements by introducing anatomically plausible but incorrect edits (mismatch, false abnormality, fabricated attributes, false disease claim), enabling fine-grained evaluation of… view at source ↗
read the original abstract

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Med-StepBench, a benchmark for step-wise hallucination detection in vision-language models on 3D oncological PET/CT scans. It comprises over 12,000 images and more than 1,000,000 image-statement pairs spanning volumetric and multi-view 2D data, decomposes clinical reasoning into four expert-designed diagnostic stages with clinician-verified annotations, evaluates general-purpose and medical VLMs to expose systematic failure modes hidden by aggregate metrics, and demonstrates model susceptibility to adversarial yet clinically plausible intermediate explanations that amplify hallucinations despite contradictory visual evidence.

Significance. If the four-stage decomposition proves representative and the annotations reliable, the benchmark fills a gap in existing medical hallucination evaluations by enabling granular assessment of multi-step reasoning in 3D imaging rather than one-shot 2D diagnosis. This could support development of safer VLMs by identifying grounding failures at specific diagnostic stages and testing robustness to plausible adversarial inputs.

major comments (3)
  1. [§3.1] §3.1 (Benchmark Construction): The four expert-designed diagnostic stages are presented as a decomposition of clinical reasoning without reported coverage analysis against independent radiologist workflows, inter-stage dependency checks, or quantification of how often real diagnostic sequences deviate from the prescribed order; this is load-bearing because the step-level hallucination labels and reported failure modes depend directly on the taxonomy being complete and unbiased.
  2. [§4.2] §4.2 (Clinician Verification): The manuscript states that annotations are clinician-verified but provides insufficient detail on inter-clinician agreement metrics, number of reviewers per statement, or resolution of ambiguous cases across the >1,000,000 pairs; without these, the reliability of the ground-truth step labels cannot be assessed and the central claim of revealing intrinsic model limitations is weakened.
  3. [§5.3] §5.3 (Adversarial Evaluation): The claim that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations requires explicit description of how these explanations were generated, how their clinical plausibility was independently verified, and quantitative comparison of hallucination rates with and without the adversarial step; the current presentation leaves open whether the amplification effect is an artifact of the stage taxonomy.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'first large-scale benchmark' should be qualified with a brief comparison to the scale and scope of prior medical hallucination benchmarks to avoid overstatement.
  2. [Figure 2] Figure 2 and Table 1: Legends and captions should explicitly distinguish results on volumetric data versus multi-view 2D projections to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the work presented and indicating where revisions will strengthen the paper.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Benchmark Construction): The four expert-designed diagnostic stages are presented as a decomposition of clinical reasoning without reported coverage analysis against independent radiologist workflows, inter-stage dependency checks, or quantification of how often real diagnostic sequences deviate from the prescribed order; this is load-bearing because the step-level hallucination labels and reported failure modes depend directly on the taxonomy being complete and unbiased.

    Authors: We appreciate the referee's emphasis on rigorously validating the stage taxonomy. The four stages were developed iteratively with expert oncological radiologists to mirror standard hierarchical reasoning in PET/CT interpretation, informed by established clinical protocols. The original submission did not include a formal coverage analysis or inter-stage dependency quantification. In the revised manuscript, we will expand §3.1 to include a detailed rationale with references to radiological guidelines, along with results from a pilot evaluation using independent radiologist annotations to assess coverage and typical sequence deviations. This addition will better support the taxonomy's representativeness. revision: yes

  2. Referee: [§4.2] §4.2 (Clinician Verification): The manuscript states that annotations are clinician-verified but provides insufficient detail on inter-clinician agreement metrics, number of reviewers per statement, or resolution of ambiguous cases across the >1,000,000 pairs; without these, the reliability of the ground-truth step labels cannot be assessed and the central claim of revealing intrinsic model limitations is weakened.

    Authors: We agree that additional methodological details are required to substantiate annotation reliability. The revised §4.2 will specify the verification protocol, including the number of clinicians, inter-rater agreement metrics, and procedures for resolving ambiguous cases. Given the dataset scale, we will clarify that verification combined comprehensive review of a stratified sample with targeted checks for the full set, enabling readers to evaluate ground-truth quality and the robustness of our findings on model limitations. revision: yes

  3. Referee: [§5.3] §5.3 (Adversarial Evaluation): The claim that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations requires explicit description of how these explanations were generated, how their clinical plausibility was independently verified, and quantitative comparison of hallucination rates with and without the adversarial step; the current presentation leaves open whether the amplification effect is an artifact of the stage taxonomy.

    Authors: We will revise §5.3 to fully detail the adversarial explanation generation process, including the constrained prompting approach used to produce clinically plausible intermediates. We will also describe the independent clinician verification for plausibility and add quantitative comparisons of hallucination rates with versus without the adversarial steps, supported by statistical tests. These enhancements will demonstrate the amplification effect while addressing potential taxonomy-related artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with no derivations or self-referential predictions

full rationale

The paper introduces Med-StepBench as a new dataset and evaluation framework for step-wise hallucination detection in medical VLMs, decomposing clinical reasoning into four expert-designed stages with clinician-verified annotations. No mathematical equations, fitted parameters, predictions of related quantities, or load-bearing self-citations appear in the provided text or abstract. The four-stage decomposition is presented as an expert design choice rather than a result derived from prior inputs or self-referential theorems. The work is self-contained empirical contribution (new images, statements, and annotations) with no reduction of any claim to its own construction, consistent with absence of any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's contribution rests on the validity of the four-stage decomposition and the quality of clinician annotations as ground truth; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption The four expert-designed diagnostic stages accurately and completely capture clinical reasoning for oncological PET/CT interpretation.
    The entire benchmark and evaluation framework is built around this decomposition.
  • domain assumption Clinician-verified annotations provide unbiased and reliable ground truth for detecting hallucinations at each step.
    All reported evaluations and failure mode analyses depend on these annotations.

pith-pipeline@v0.9.0 · 5549 in / 1430 out tokens · 88031 ms · 2026-05-12T03:49:18.247369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    A framework to assess clin- ical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8(1):274,

    [Asgariet al., 2025 ] Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clin- ical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8(1):274,

  2. [2]

    Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neu- ral Information Processing Systems, 36:3867–3880,

    [Baeet al., 2023 ] Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neu- ral Information Processing Systems, 36:3867–3880,

  3. [3]

    M3d: Advancing 3d medical image analysis with multi-modal large language models,

    [Baiet al., 2024 ] Fan Bai, Yuxin Du, Tiejun Huang, Max Qinghu Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models,

  4. [4]

    Qwen3-vl technical report,

    [Baiet al., 2025 ] Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report,

  5. [5]

    Hasan, and Hen- ning Müller

    [Ben Abachaet al., 2021 ] Asma Ben Abacha, Mourad Sar- routi, Dina Demner-Fushman, Sadid A. Hasan, and Hen- ning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InCLEF 2021 Conference and Labs of the Evaluation Forum - Working Notes,

  6. [6]

    Medheval: Benchmarking hallucinations and mitigation strategies in medical large vision–language models,

    [Changet al., 2025 ] Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. Medheval: Benchmarking hallucinations and mitigation strategies in medical large vision–language models,

  7. [7]

    Detecting and evaluating medical hallucinations in large vision language models,

    [Chenet al., 2024 ] Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models,

  8. [8]

    [Comaniciet al., 2025 ] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al

    Med-HallMark benchmark. [Comaniciet al., 2025 ] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

  9. [9]

    Medvh: Toward sys- tematic evaluation of hallucination for large vision lan- guage models in the medical context.Advanced Intelligent Systems, 8(1):2500255,

    [Guet al., 2026 ] Zishan Gu, Jiayuan Chen, Fenglin Liu, Changchang Yin, and Ping Zhang. Medvh: Toward sys- tematic evaluation of hallucination for large vision lan- guage models in the medical context.Advanced Intelligent Systems, 8(1):2500255,

  10. [10]

    Xing, and Pengtao Xie

    [Heet al., 2020 ] Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint,

  11. [11]

    Hofman and Rod- ney J

    [Hofman and Hicks, 2016] Michael S. Hofman and Rod- ney J. Hicks. How we read oncologic FDG PET/CT.Can- cer Imaging, 16(1):35,

  12. [12]

    Rrg-mamba: Efficient ra- diology report generation with state space model

    [Houet al., 2025 ] Xiaodi Hou, Xiaobo Li, Mingyu Lu, Simiao Wang, and Yijia Zhang. Rrg-mamba: Efficient ra- diology report generation with state space model. In James Kwok, editor,Proceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 25, pages 7410–7418. International Joint Conferences on Artificial Intellige...

  13. [13]

    [Jianget al., 2025 ] Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu

    Main Track. [Jianget al., 2025 ] Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, and Zuozhu Liu. Modality-fair preference optimiza- tion for trustworthy mllm alignment. In James Kwok, edi- tor,Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 403–411...

  14. [14]

    Available: https://arxiv.org/abs/2503.05777

    Main Track. [Kimet al., 2025 ] Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Al- hamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their im- pact on healthcare.arXiv preprint arXiv:2503.05777,

  15. [15]

    A dataset of clinically generated visual questions and answers about ra- diology images.Scientific data, 5(1):1–10,

    [Lauet al., 2018 ] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about ra- diology images.Scientific data, 5(1):1–10,

  16. [16]

    Llava- med: Training a large language-and-vision assistant for biomedicine in one day,

    [Liet al., 2023 ] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava- med: Training a large language-and-vision assistant for biomedicine in one day,

  17. [17]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

    [Liuet al., 2021 ] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE,

  18. [18]

    Visual instruction tuning,

    [Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning,

  19. [19]

    Argus: benchmarking and enhancing vision-language models for 3d radiology report genera- tion

    [Liuet al., 2025 ] Che Liu, Zhongwei Wan, Yuqi Wang, Hui Shen, Haozhe Wang, Kangyu Zheng, Mi Zhang, and Rossella Arcucci. Argus: benchmarking and enhancing vision-language models for 3d radiology report genera- tion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 16448–16460,

  20. [20]

    Med- flamingo: a multimodal medical few-shot learner

    [Mooret al., 2023 ] Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med- flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR,

  21. [21]

    Gpt-4o system card,

    [OpenAI, 2024] OpenAI. Gpt-4o system card,

  22. [22]

    Medm- vl: What makes a good medical lvlm?,

    [Shiet al., 2025 ] Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, and Ji Wu. Medm- vl: What makes a good medical lvlm?,

  23. [23]

    Understanding vi- sual detail hallucinations of large vision-language mod- els

    [Sunet al., 2025 ] Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, and Dongyan Zhao. Understanding vi- sual detail hallucinations of large vision-language mod- els. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 1900–1908,

  24. [24]

    Medklip: Medical knowl- edge enhanced language-image pre-training for x-ray di- agnosis

    [Wuet al., 2023 ] Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowl- edge enhanced language-image pre-training for x-ray di- agnosis. InProceedings of the IEEE/CVF international conference on computer vision, pages 21372–21383,

  25. [25]

    Hallucination benchmark in medical visual question an- swering

    [Wuet al., 2024 ] Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question an- swering. InThe Second Tiny Papers Track at ICLR 2024,

  26. [26]

    Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Sys- tems, 37:140334–140365,

    [Xiaet al., 2024 ] Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Sys- tems, 37:140334–140365,

  27. [27]

    Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa

    [Yanet al., 2025 ] Qianqi Yan, Xuehai He, Xiang Yue, and Xin Eric Wang. Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 19188–19205,

  28. [28]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint,

    [Zhanget al., 2023 ] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint,

  29. [29]

    [Zhouet al., 2025 ] Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, and Xuegong Zhang. DrVD-bench: Do vision-language models reason like hu- man doctors in medical image diagnosis? InThe Thirty- ninth Annual Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track,

  30. [30]

    Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models

    [Zhuet al., 2025 ] Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, Qingqing Long, Yefeng Zheng, and Xian Wu. Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models. In Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of t...

  31. [31]

    [Zuo and Jiang, 2025] Kaiwen Zuo and Yirui Jiang

    Association for Compu- tational Linguistics. [Zuo and Jiang, 2025] Kaiwen Zuo and Yirui Jiang. Med- hallbench: A new benchmark for assessing hallucination in medical large language models. InAAAI Bridge Pro- gram on AI for Medicine and Healthcare, pages 205–213. PMLR, 2025