Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Dingkang Yang; Dongling Xiao; Jiawei Chen; Ke Li; Lihua Zhang; Mingcheng Li; Shunli Wang; Tong Wu; Xiaolu Hou; Yue Jiang

arxiv: 2406.10185 · v2 · submitted 2024-06-14 · 💻 cs.CV

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen , Dingkang Yang , Tong Wu , Yue Jiang , Xiaolu Hou , Mingcheng Li , Shunli Wang , Dongling Xiao

show 2 more authors

Ke Li Lihua Zhang

This is my paper

Pith reviewed 2026-05-23 23:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical hallucinationslarge vision language modelshallucination benchmarkhallucination detectionmedical imagingAI safetymultimodal evaluation

0 comments

The pith

Med-HallMark benchmark and MediHall Score enable detection and granular evaluation of hallucinations in medical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first benchmark dedicated to hallucinations in medical multimodal AI, filling a gap where existing tools do not address domain-specific risks. It supplies Med-HallMark with multi-task support, varied hallucination examples, and a hierarchy of error types. The work pairs this with MediHall Score, a metric that assigns values based on hallucination severity and category to reflect potential clinical harm more closely than standard accuracy measures. Experiments then set baselines for current models and introduce a detector trained specifically for the task.

Core claim

Med-HallMark supplies the first medical multimodal benchmark with multi-task hallucination support, multifaceted data, and hierarchical categorization; MediHall Score applies a hierarchical scoring system that weighs severity and type to assess clinical impacts; and MediHallDetector, trained via multitask learning on the benchmark, achieves improved detection performance over prior approaches.

What carries the argument

Med-HallMark benchmark together with the MediHall Score hierarchical severity-and-type metric that produces a single numeric assessment of hallucination risk.

If this is right

Existing large vision-language models receive explicit baseline hallucination rates on medical tasks.
MediHallDetector demonstrates higher detection accuracy than general-purpose methods when tested on the new benchmark.
The scoring system distinguishes hallucination impacts at a finer level than aggregate accuracy or F1 metrics alone.
Resources released allow other groups to train and compare models on the same medical hallucination tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption could shift medical AI evaluation from binary correctness checks toward graded risk profiles before clinical use.
The multitask training pattern in MediHallDetector may transfer to hallucination detection in non-medical high-stakes vision-language settings.
If the hierarchy proves stable, future work could extend the same structure to generate training signals that penalize high-severity errors more heavily.

Load-bearing premise

The hierarchical categories and severity levels used in MediHall Score match the actual clinical consequences that would arise from those errors in real medical settings.

What would settle it

A controlled deployment study in which model outputs scored by MediHall Score show no correlation with independent expert ratings of clinical harm on the same cases.

Figures

Figures reproduced from arXiv: 2406.10185 by Dingkang Yang, Dongling Xiao, Jiawei Chen, Ke Li, Lihua Zhang, Mingcheng Li, Shunli Wang, Tong Wu, Xiaolu Hou, Yue Jiang.

**Figure 2.** Figure 2: Visualization of MediHalldetector related information. (a) Model structure, SFT process [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of different Models on hallucination types. Analysis of six-dimensional hallucination level [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prefix of confidence-weakening questions. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompts for GPT-4 to create counterfactual questions. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Instructions for MedHallDetector to SFT and inferencing on the VQA and IRG tasks. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Instructions for baseline model to inference on the IRG task. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompts for GPT-3.5 to expand origin questions. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of output text at different hallucination levels (Catastrophic Hallucination, Critical [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Responses from different models on conventional questions. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Responses from different models on confidence-weakening questions. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Responses from different models on counterfactual questions. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Responses from different models on image depiction questions (IRG). [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of MedHallDetector and other powerful LVLMS on multimodal hallucination [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work have been released at https://github.com/ydk122024/Med-HallMark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Med-HallMark supplies the first dedicated benchmark for medical LVLM hallucinations and releases the data, but the MediHall Score's severity hierarchy rests on unvalidated modeling choices rather than clinical evidence.

read the letter

The paper's main contribution is Med-HallMark, a benchmark that targets hallucination detection in medical vision-language models across multiple tasks with hierarchical categories. It also ships MediHallDetector, a multitask model trained for this purpose, and reports baselines on existing LVLMs. The GitHub release of the resources is a practical plus for anyone who wants to run the evaluations themselves. That addresses a clear safety gap as these models move into report generation and visual QA in healthcare. The abstract and stress-test note both indicate the authors define the categories and scoring rules internally, which is a reasonable starting point for a new benchmark. The work shows clear engagement with the general hallucination literature and extends it to the medical multimodal setting without obvious circularity or invented entities. The central limitation is that MediHall Score claims to give a more granular view of clinical impact through its type-and-severity hierarchy, yet the description provides no inter-rater checks with clinicians, no correlation to downstream error rates, and no mapping to existing medical error taxonomies. That makes the improvement over standard metrics an assumption rather than a demonstrated result. Minor issues include the lack of statistical detail on the reported gains in the abstract, though the full text may contain more. This paper is aimed at groups already working on medical multimodal models or on hallucination mitigation. It is coherent on its own terms and deserves a serious referee because the problem is timely and the artifact is new, even if the scoring system will need external validation in revision.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce Med-HallMark as the first benchmark for hallucination detection and evaluation in medical LVLMs, supporting multi-tasking, multifaceted data, and hierarchical categorization. It proposes the MediHall Score as a hierarchical metric that accounts for hallucination type and severity to enable granular assessment of clinical impacts, along with MediHallDetector, a multitask-trained LVLM for detection. Experiments establish baselines for existing LVLMs and report that the new score is more nuanced than traditional metrics while the detector shows enhanced performance; all resources are released publicly.

Significance. If the benchmark construction and MediHall Score hierarchy can be externally validated, the work would fill a clear gap by providing domain-specific tools for assessing LVLMs in high-stakes medical imaging and VQA tasks. The public release of the benchmark, code, and data at the cited GitHub repository is a concrete strength that supports reproducibility.

major comments (3)

[MediHall Score subsection] The definition and scoring rules for the MediHall Score (hierarchical severity levels and type categorization) are presented without any reported anchoring to clinical reality, such as inter-rater agreement statistics with practicing clinicians, correlation against downstream clinical error rates, or mapping to existing medical error taxonomies. This directly affects the central claim that the score enables a more nuanced understanding of clinical impacts than traditional metrics.
[Benchmark construction section] The description of Med-HallMark benchmark construction provides no details on data sourcing, generation of the multifaceted hallucination examples, or any validation procedure (e.g., expert review or coverage analysis) for the hierarchical categories. Without this, it is impossible to evaluate whether the benchmark actually covers the error types that occur in medical practice, which is load-bearing for the claim that it is the first dedicated medical multimodal hallucination benchmark.
[Experimental evaluations] The experimental section reports performance gains for MediHallDetector and advantages of the MediHall Score but does not include statistical significance tests, confidence intervals, or ablation controls on the multitask training. This weakens the ability to substantiate the baseline comparisons and the detector's claimed superiority.

minor comments (1)

[MediHall Score subsection] Notation for the MediHall Score components could be introduced more formally with an equation or pseudocode to improve clarity when describing the hierarchical computation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [MediHall Score subsection] The definition and scoring rules for the MediHall Score (hierarchical severity levels and type categorization) are presented without any reported anchoring to clinical reality, such as inter-rater agreement statistics with practicing clinicians, correlation against downstream clinical error rates, or mapping to existing medical error taxonomies. This directly affects the central claim that the score enables a more nuanced understanding of clinical impacts than traditional metrics.

Authors: We agree that empirical anchoring to clinical practice would strengthen the MediHall Score. The initial manuscript does not report inter-rater agreement or direct clinical correlations. In revision we will add a validation subsection with clinician ratings and inter-rater statistics to support the claim of nuanced clinical assessment. revision: yes
Referee: [Benchmark construction section] The description of Med-HallMark benchmark construction provides no details on data sourcing, generation of the multifaceted hallucination examples, or any validation procedure (e.g., expert review or coverage analysis) for the hierarchical categories. Without this, it is impossible to evaluate whether the benchmark actually covers the error types that occur in medical practice, which is load-bearing for the claim that it is the first dedicated medical multimodal hallucination benchmark.

Authors: We agree that the construction details are insufficient. We will expand the Benchmark construction section with explicit information on data sources, hallucination example generation, and any validation steps performed. revision: yes
Referee: [Experimental evaluations] The experimental section reports performance gains for MediHallDetector and advantages of the MediHall Score but does not include statistical significance tests, confidence intervals, or ablation controls on the multitask training. This weakens the ability to substantiate the baseline comparisons and the detector's claimed superiority.

Authors: We agree that statistical rigor is needed. We will add significance tests, confidence intervals, and multitask ablation controls to the experimental section in revision. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metric introduced by explicit construction

full rationale

The paper introduces Med-HallMark benchmark and MediHall Score via new hierarchical categorization and scoring rules defined within the work itself. No equations, fitted parameters called predictions, or load-bearing self-citations appear in the provided text. Claims rest on the artifacts' construction and experimental baselines rather than any reduction to prior inputs or self-referential definitions. This is a standard case of artifact introduction with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view shows no explicit free parameters, axioms, or invented entities; the work rests on the domain assumption that hallucinations can be meaningfully categorized by type and clinical severity, but no specific fitted values or new postulated entities are mentioned.

pith-pipeline@v0.9.0 · 5829 in / 1114 out tokens · 16923 ms · 2026-05-23T23:54:23.106830+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation
cs.CV 2026-05 conditional novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that...
Evaluating the Search Agent in a Parallel World
cs.AI 2026-03 unverdicted novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping ...
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
cs.CV 2026-05 unverdicted novelty 6.0

Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 6.0

VIHD detects hallucinations in medical MLLMs by identifying visually dominant decoder layers via probing and applying visual token masking to calibrate semantic entropy as a detection signal.
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
cs.CV 2026-04 unverdicted novelty 5.0

MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 7 Pith papers · 11 internal anchors

[1]

GPT-4V(vision) system card

Openai, 2023. GPT-4V(vision) system card. 1

work page 2023
[2]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Miss: A generative pretraining and finetuning approach for med-vqa

Jiawei Chen, Dingkang Yang, Yue Jiang, et al. Miss: A generative pretraining and finetuning approach for med-vqa. arXiv preprint arXiv:2401.05163, 2024. 3, 4

work page arXiv 2024
[4]

Efficiency in focus: Layernorm as a catalyst for fine-tuning medical visual language pre-trained models

Jiawei Chen, Dingkang Yang, Yue Jiang, Mingcheng Li, Jinjie Wei, Xiaolu Hou, and Lihua Zhang. Efficiency in focus: Layernorm as a catalyst for fine-tuning medical visual language pre-trained models. arXiv preprint arXiv:2404.16385, 2024. 3

work page arXiv 2024
[5]

Unified hallucination detection for multimodal large language models

Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multimodal large language models. arXiv preprint arXiv:2402.03190, 2024. 2

work page arXiv 2024
[6]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 18135–18143, 2024. 2

work page 2024
[8]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. 2

work page 2023
[9]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317,

work page
[10]

A dataset of clinically generated visual questions and answers about radiology images

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018. 2, 4

work page 2018
[11]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Zhang, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 2, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021. 2, 4

work page 2021
[15]

Mitigat- ing hallucination in large multi-modal models via robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigat- ing hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023. 2, 3

work page 2023
[16]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 6

work page 2024
[17]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Wu, et al. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Radiology objects in context (roco): a multimodal image dataset

Obioma Pelka, Sven Koitka, Rückert, et al. Radiology objects in context (roco): a multimodal image dataset. In LABELS 2018, MICCAI 2018, pages 180–189, 2018. 2

work page 2018
[19]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Xraygpt: Chest radiographs summarization using medical vision-language models

Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023. 2, 7

work page arXiv 2023
[21]

Evaluation and analysis of hallucination in large vision-language models

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. 3

work page arXiv 2023
[22]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 2, 4

work page 2097
[23]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715,

work page arXiv
[24]

Towards generalist foundation model for radiology, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology, 2023. 7

work page 2023
[25]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. 2

work page arXiv 2024
[26]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 7 12 Confidence-weakening Prefix1. As a developing AI with limited understanding of complex medical contexts, please provide your best interpretation to the qu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

GPT-4V(vision) system card

Openai, 2023. GPT-4V(vision) system card. 1

work page 2023

[2] [2]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Miss: A generative pretraining and finetuning approach for med-vqa

Jiawei Chen, Dingkang Yang, Yue Jiang, et al. Miss: A generative pretraining and finetuning approach for med-vqa. arXiv preprint arXiv:2401.05163, 2024. 3, 4

work page arXiv 2024

[4] [4]

Efficiency in focus: Layernorm as a catalyst for fine-tuning medical visual language pre-trained models

Jiawei Chen, Dingkang Yang, Yue Jiang, Mingcheng Li, Jinjie Wei, Xiaolu Hou, and Lihua Zhang. Efficiency in focus: Layernorm as a catalyst for fine-tuning medical visual language pre-trained models. arXiv preprint arXiv:2404.16385, 2024. 3

work page arXiv 2024

[5] [5]

Unified hallucination detection for multimodal large language models

Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multimodal large language models. arXiv preprint arXiv:2402.03190, 2024. 2

work page arXiv 2024

[6] [6]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 18135–18143, 2024. 2

work page 2024

[8] [8]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. 2

work page 2023

[9] [9]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317,

work page

[10] [10]

A dataset of clinically generated visual questions and answers about radiology images

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018. 2, 4

work page 2018

[11] [11]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Zhang, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 2, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021. 2, 4

work page 2021

[15] [15]

Mitigat- ing hallucination in large multi-modal models via robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigat- ing hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023. 2, 3

work page 2023

[16] [16]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 6

work page 2024

[17] [17]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Wu, et al. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Radiology objects in context (roco): a multimodal image dataset

Obioma Pelka, Sven Koitka, Rückert, et al. Radiology objects in context (roco): a multimodal image dataset. In LABELS 2018, MICCAI 2018, pages 180–189, 2018. 2

work page 2018

[19] [19]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Xraygpt: Chest radiographs summarization using medical vision-language models

Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023. 2, 7

work page arXiv 2023

[21] [21]

Evaluation and analysis of hallucination in large vision-language models

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. 3

work page arXiv 2023

[22] [22]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 2, 4

work page 2097

[23] [23]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715,

work page arXiv

[24] [24]

Towards generalist foundation model for radiology, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology, 2023. 7

work page 2023

[25] [25]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. 2

work page arXiv 2024

[26] [26]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 7 12 Confidence-weakening Prefix1. As a developing AI with limited understanding of complex medical contexts, please provide your best interpretation to the qu...

work page internal anchor Pith review Pith/arXiv arXiv 2023