Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Aizan Zafar; Amgad Muneer; Anas Zafar; Rizwan Qureshi; Sagar Chhabriya; Shaina Raza; Sheeraz Arif; Sumra Khan

arxiv: 2604.08815 · v1 · submitted 2026-04-09 · 💻 cs.CV

Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Sumra Khan , Sagar Chhabriya , Aizan Zafar , Sheeraz Arif , Amgad Muneer , Anas Zafar , Shaina Raza , Rizwan Qureshi This is my paper

Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal reasoningvision-language modelsmedical imaginghallucination mitigationcontext alignmentresponsible AIchest X-ray analysisexplainable AI

0 comments

The pith

Enforcing agreement across medical evidence sources reduces hallucinations in vision-language model diagnoses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for medicine can produce fluent but poorly grounded answers by favoring one type of input over others. The paper introduces a framework that forces the model to verify consistency among different pieces of clinical information before reaching a conclusion. It does this by adding fixed signals from image measurements, model attention maps, and word-based clues to an unchanged base model. When these signals are checked against each other, the outputs become more accurate, contain fewer invented details, and use fewer words while keeping uncertainty estimates reliable. This matters because trustworthy reasoning is essential for using AI safely in clinical decisions.

Core claim

The proposed context-aligned reasoning framework augments a frozen vision-language model with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of free-form responses, the model produces structured outputs that include supporting evidence, uncertainty estimates, limitations, and safety notes. Performance gains occur only when these signals are integrated via contextual verification, leading to higher AUC scores, fewer hallucinated keywords, and more concise explanations on chest X-ray datasets.

What carries the argument

Contextual verification mechanism that integrates radiomic, explainability, and semantic signals to enforce multi-evidence agreement prior to diagnostic output.

Load-bearing premise

The auxiliary signals from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues are accurate enough and sufficiently independent to support reliable contextual verification.

What would settle it

Observing no reduction in hallucinations or no AUC improvement on a test set where the auxiliary signals are deliberately made inaccurate or contradictory.

Figures

Figures reproduced from arXiv: 2604.08815 by Aizan Zafar, Amgad Muneer, Anas Zafar, Rizwan Qureshi, Sagar Chhabriya, Shaina Raza, Sheeraz Arif, Sumra Khan.

**Figure 1.** Figure 1: Context-aligned reasoning for responsible medical VLMs. Conventional medical vision–language models often rely on a dominant modality, which can produce confident but weakly grounded conclusions. Our framework augments the VLM with heterogeneous clinical evidence sources, including radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. These signals are unified through a c… view at source ↗

**Figure 2.** Figure 2: AUC comparison across modality combinations for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable way to ground medical VLMs with auxiliary signals and structured outputs, but the reported gains rest on untested assumptions about signal quality and lack the controls needed to evaluate them.

read the letter

The core idea is to freeze a vision-language model and feed it three extra signals—radiomic statistics, explainability activations, and vocabulary-grounded cues—then require them to agree before the model produces a structured diagnosis that includes evidence, uncertainty, limitations, and safety notes. The abstract reports that this context alignment cuts hallucinated keywords from 1.14 to 0.25, shortens explanations from 19.4 to 15.3 words, and nudges AUC from 0.918 to 0.925 on chest X-ray data while keeping uncertainty calibration steady. It also notes that the individual signals give little benefit on their own and that gains appear only after integration. That last point is the most useful observation for anyone working on multimodal medical models.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a context-aligned reasoning framework for medical vision-language models that augments a frozen VLM with auxiliary signals from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. These signals are integrated via contextual verification to enforce multi-evidence agreement, producing structured outputs that include supporting evidence, uncertainty estimates, limitations, and safety notes rather than free-form responses. Experiments on chest X-ray datasets report that context alignment improves AUC from 0.918 to 0.925, reduces hallucinated keywords from 1.14 to 0.25, yields more concise explanations (19.4 to 15.3 words), and maintains calibrated uncertainty (confidence 0.70 to 0.68), with additional cross-dataset insights on CheXpert regarding modality informativeness.

Significance. If the empirical claims hold after detailed validation, the work would meaningfully advance responsible deployment of multimodal models in medicine by showing that explicit multi-signal agreement can reduce hallucinations and improve grounding without altering base model parameters. The structured output format and emphasis on uncertainty calibration are practical strengths for clinical trustworthiness. The observation that gains require signal integration (rather than individual auxiliaries) is a useful empirical insight, though it requires stronger supporting evidence to be fully convincing.

major comments (3)

[Experiments] Experiments section: The central claim that 'performance gains emerge only when these signals are integrated through contextual verification' is not supported by any ablation studies, comparisons to single-signal baselines, or quantitative metrics on signal agreement rates. The abstract reports specific deltas (AUC +0.007, hallucinated keywords 1.14→0.25) but supplies no statistical tests, error bars, or failure-mode analysis on cases of genuine evidence conflict, which directly undermines assessment of whether improvements are robust or artifacts of signal quality.
[Methods] Methods section: The contextual verification mechanism lacks any equations, algorithms, or pseudocode describing how radiomic statistics, explainability activations, and vocabulary-grounded semantic cues are combined, thresholded, or used to verify agreement. This is load-bearing for the claim of reliable integration, as the paper provides no error bounds, independence checks, or conflict-resolution rules, leaving open the possibility that correlated errors across signals could produce the observed reductions in hallucinations.
[Results/Experiments] §4 (or equivalent results section): No implementation details are given for the frozen VLM backbone, the exact radiomic features extracted, the explainability method used, or how structured outputs are generated and evaluated. Without these, the reported metrics cannot be reproduced or stress-tested against the skeptic's concern that auxiliary signals may systematically misalign on the same subsets used for AUC and hallucination evaluation.

minor comments (2)

[Abstract] The abstract refers to 'chest X-ray datasets' without naming them explicitly (e.g., whether MIMIC-CXR, NIH, or others); this should be clarified for reproducibility.
[Experiments] Cross-dataset evaluation on CheXpert is mentioned but not detailed with specific metrics or comparisons; a table or subsection would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and commit to revisions that strengthen the empirical support, formalization, and reproducibility of the work.

read point-by-point responses

Referee: Experiments section: The central claim that 'performance gains emerge only when these signals are integrated through contextual verification' is not supported by any ablation studies, comparisons to single-signal baselines, or quantitative metrics on signal agreement rates. The abstract reports specific deltas (AUC +0.007, hallucinated keywords 1.14→0.25) but supplies no statistical tests, error bars, or failure-mode analysis on cases of genuine evidence conflict, which directly undermines assessment of whether improvements are robust or artifacts of signal quality.

Authors: We acknowledge that the submitted manuscript states the observation about auxiliary signals providing limited benefit without presenting the supporting ablation tables, agreement-rate metrics, statistical tests, or error bars. In the revision we will add a dedicated ablation subsection that reports performance for the no-auxiliary baseline, each individual signal, all pairwise combinations, and the full integrated setting. We will include signal-agreement rates, 5-run standard deviations as error bars, Wilcoxon signed-rank tests for the reported deltas, and a failure-mode analysis on the subset of cases where at least two signals disagree. These additions will directly substantiate the central claim. revision: yes
Referee: Methods section: The contextual verification mechanism lacks any equations, algorithms, or pseudocode describing how radiomic statistics, explainability activations, and vocabulary-grounded semantic cues are combined, thresholded, or used to verify agreement. This is load-bearing for the claim of reliable integration, as the paper provides no error bounds, independence checks, or conflict-resolution rules, leaving open the possibility that correlated errors across signals could produce the observed reductions in hallucinations.

Authors: We agree that a formal specification is required. The verification step computes per-signal normalized agreement scores (radiomic consistency via feature overlap, activation overlap via IoU on saliency maps, and semantic grounding via entity matching), then applies a weighted majority threshold with explicit conflict resolution that defers to the highest-reliability signal. In the revision we will insert the exact equations for each score, a complete pseudocode algorithm, bootstrap-derived error bounds on the agreement threshold, pairwise correlation coefficients among signals, and a short discussion of how the framework handles potential correlated errors. revision: yes
Referee: §4 (or equivalent results section): No implementation details are given for the frozen VLM backbone, the exact radiomic features extracted, the explainability method used, or how structured outputs are generated and evaluated. Without these, the reported metrics cannot be reproduced or stress-tested against the skeptic's concern that auxiliary signals may systematically misalign on the same subsets used for AUC and hallucination evaluation.

Authors: We will expand the implementation subsection to specify the exact frozen VLM backbone and its checkpoint, the complete list of radiomic features extracted via PyRadiomics, the explainability technique (including layer and normalization details), and the precise evaluation pipeline for structured outputs (keyword hallucination detection via UMLS entity matching, AUC computation protocol, and how uncertainty and safety notes are scored). These details will enable independent reproduction and targeted stress-testing on potential misalignment subsets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from dataset experiments

full rationale

The paper introduces a context-aligned reasoning framework that augments a frozen VLM with auxiliary signals (radiomics, explainability, semantic cues) and enforces multi-evidence agreement before generating structured outputs. All reported gains—AUC improvement from 0.918 to 0.925, hallucinated keywords reduced from 1.14 to 0.25, explanation length from 19.4 to 15.3 words—are presented as direct measurements on chest X-ray and CheXpert datasets rather than quantities obtained by solving the paper's own equations or by renaming fitted parameters. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claim rests on external experimental outcomes and therefore remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes the framework at a high level without specifying any free parameters, mathematical axioms, or new invented entities.

pith-pipeline@v0.9.0 · 5555 in / 1241 out tokens · 66347 ms · 2026-05-10T16:52:15.124737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

A comprehensive survey on the trustworthiness of large language models in healthcare.arXiv preprint arXiv:2502.15871, 4, 2025

Manar Aljohani, Jun Hou, Sindhura Kommu, and Xuan Wang. A comprehensive survey on the trustworthiness of large language models in healthcare.arXiv preprint arXiv:2502.15871, 4, 2025. 1

work page arXiv 2025
[2]

Emerging trends in multi- modal artificial intelligence for clinical decision support: A narrative review.Health Informatics Journal, 31(3): 14604582251366141, 2025

Nurittin Ardic and Rasit Dinc. Emerging trends in multi- modal artificial intelligence for clinical decision support: A narrative review.Health Informatics Journal, 31(3): 14604582251366141, 2025. 1

work page 2025
[3]

arXiv preprint arXiv:2512.16201 (2025)

Sarosij Bose, Ravi K Rajendran, Biplob Debnath, Konstanti- nos Karydis, Amit K Roy-Chowdhury, and Srimat Chakrad- har. Visual alignment of medical vision-language models for grounded radiology report generation.arXiv preprint arXiv:2512.16201, 2025. 2

work page arXiv 2025
[4]

Medical phrase ground- ing with region-phrase context contrastive alignment

Zhihao Chen, Yang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Su Kai Ooi, Lionel Tim-Ee Cheng, Choon Hua Thng, Xinxing Xu, Yong Liu, et al. Medical phrase ground- ing with region-phrase context contrastive alignment. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 371–381. Springer,

work page
[5]

Multimodal computing in healthcare: Enhancing clinical decision-making through data fusion

Sayantan Dass, Sujoy Mistry, Pradyut Sarkar, and Keshav Dahal. Multimodal computing in healthcare: Enhancing clinical decision-making through data fusion. InAdvances in Healthcare using Machine Learning, pages 53–84. CRC Press, 2025. 1

work page 2025
[6]

Med-glip: Advancing medical language-image pre-training with large-scale grounded dataset.arXiv preprint arXiv:2508.10528, 2025

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, and Zuozhu Liu. Med-glip: Advancing medical language-image pre-training with large- scale grounded dataset.arXiv preprint arXiv:2508.10528,

work page arXiv
[7]

Mohammad Ennab and Hamid Mcheick. Advancing ai in- terpretability in medical imaging: a comparative analysis of pixel-level interpretability and grad-cam models.Machine Learning and Knowledge Extraction, 7(1):12, 2025. 2

work page 2025
[8]

Vision-language mod- els for medical report generation and visual question answer- ing: A review.Frontiers in artificial intelligence, 7:1430984,

Iryna Hartsock and Ghulam Rasool. Vision-language mod- els for medical report generation and visual question answer- ing: A review.Frontiers in artificial intelligence, 7:1430984,

work page
[9]

A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025. 1

work page 2025
[10]

Seeing the trees for the forest: rethinking weakly-supervised medical visual grounding

Ta Duc Huy, Duy Anh Huynh, Yutong Xie, Yuankai Qi, Qi Chen, Phi Le Nguyen, Sen Kim Tran, Son Lam Phung, An- ton van den Hengel, Zhibin Liao, et al. Seeing the trees for the forest: rethinking weakly-supervised medical visual grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24445–24455, 2025. 2

work page 2025
[11]

B. C. Kalp ´elb´e, A. G. Adaambiik, and W. Peng. Vision- language models in medicine: A survey.arXiv preprint arXiv:2503.01863, 2025. 1, 2

work page arXiv 2025
[12]

arXiv preprint arXiv:2503.13939 , year=

Yizhou Lai, Jindong Zhong, Meng Li, Shuo Zhao, Yi Li, Konstantinos Psounis, and Xinyao Yang. Med-r1: Rein- forcement learning for generalizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939,

work page arXiv
[13]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Chunyuan Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 1, 2

work page 2024
[14]

Aor: Anatomical ontology-guided reason- ing for medical large multimodal model in chest x-ray inter- pretation.arXiv preprint arXiv:2505.02830, 2025

Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, et al. Aor: Anatomical ontology-guided reason- ing for medical large multimodal model in chest x-ray inter- pretation.arXiv preprint arXiv:2505.02830, 2025. 1

work page arXiv 2025
[15]

From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research

Amgad Muneer, Muhammad Waqas, Maliazurina B Saad, Eman Showkatian, Rukhmini Bandyopadhyay, Hui Xu, Wentao Li, Joe Y Chang, Zhongxing Liao, Cara Haymaker, et al. From classical machine learning to emerging founda- tion models: review on multimodal data integration for can- cer research.arXiv preprint arXiv:2507.09028, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

The future of radiology: The path towards mul- timodal ai and superdiagnostics.European Journal of Radi- ology Artificial Intelligence, 2:100014, 2025

Felix Nensa. The future of radiology: The path towards mul- timodal ai and superdiagnostics.European Journal of Radi- ology Artificial Intelligence, 2:100014, 2025. 1

work page 2025
[17]

Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact.arXiv preprint arXiv:2507.00951, 2025

Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Ab- delrahman Eldaly, Kai Zhang, Ferhat Sadak, et al. Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact.arXiv preprint arXiv:2507.00951, 2025. 1

work page arXiv 2025
[18]

Who is re- sponsible? the data, models, users or regulations? a compre- hensive survey on responsible generative ai for a sustainable future.arXiv preprint arXiv:2502.08650, 2025

Shaina Raza, Rizwan Qureshi, Anam Zahid, Safiullah Ka- mawal, Ferhat Sadak, Joseph Fioresi, Muhammaed Saeed, Ranjan Sapkota, Aditya Jain, Anas Zafar, et al. Who is re- sponsible? the data, models, users or regulations? a compre- hensive survey on responsible generative ai for a sustainable future.arXiv preprint arXiv:2502.08650, 2025. 4

work page arXiv 2025
[19]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 2022. 2

work page 2022
[20]

URL https://arxiv.org/abs/ 2308.02463

Chaoyi Wu, Xiaoman Zhang, Yifan Zhang, Yijie Wang, and Weidi Xie. Towards generalist foundation model for radiol- ogy.arXiv preprint arXiv:2308.02463, 2023. 2

work page arXiv 2023
[21]

& Tang, H

Jiarui Ye and Hao Tang. Multimodal large language mod- els for medicine: A comprehensive survey.arXiv preprint arXiv:2504.21051, 2025. 1

work page arXiv 2025
[22]

Beyond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026

Anas Zafar, Leema Krishna Murali, and Ashish Vashist. Be- yond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026. 1

work page arXiv 2026
[23]

arXiv preprint arXiv:2305.17100 , volume=

Kaicheng Zhang, Jifan Yu, Ekta Adhikarla, Ru Zhou, Zhen Yan, Yutong Liu, Zicheng Liu, Liang He, Brian Davison, and Xiang Li. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks.arXiv preprint arXiv:2305.17100, 2023. 1, 2

work page arXiv 2023
[24]

Can we trust ai doctors? a survey of medical hallucination in large language and vision-language models.Findings of the Asso- ciation for Computational Linguistics (ACL), 2025

Zhenyu Zhu, Yuxin Zhang, Xiahai Zhuang, et al. Can we trust ai doctors? a survey of medical hallucination in large language and vision-language models.Findings of the Asso- ciation for Computational Linguistics (ACL), 2025. 1, 2

work page 2025
[25]

Uncertainty-aware medical diagnostic phrase identification and grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ke Zou, Yang Bai, Bo Liu, Yidi Chen, Zhihao Chen, Yang Zhou, Xuedong Yuan, Meng Wang, Xiaojing Shen, Xi- aochun Cao, et al. Uncertainty-aware medical diagnostic phrase identification and grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025

[1] [1]

A comprehensive survey on the trustworthiness of large language models in healthcare.arXiv preprint arXiv:2502.15871, 4, 2025

Manar Aljohani, Jun Hou, Sindhura Kommu, and Xuan Wang. A comprehensive survey on the trustworthiness of large language models in healthcare.arXiv preprint arXiv:2502.15871, 4, 2025. 1

work page arXiv 2025

[2] [2]

Emerging trends in multi- modal artificial intelligence for clinical decision support: A narrative review.Health Informatics Journal, 31(3): 14604582251366141, 2025

Nurittin Ardic and Rasit Dinc. Emerging trends in multi- modal artificial intelligence for clinical decision support: A narrative review.Health Informatics Journal, 31(3): 14604582251366141, 2025. 1

work page 2025

[3] [3]

arXiv preprint arXiv:2512.16201 (2025)

Sarosij Bose, Ravi K Rajendran, Biplob Debnath, Konstanti- nos Karydis, Amit K Roy-Chowdhury, and Srimat Chakrad- har. Visual alignment of medical vision-language models for grounded radiology report generation.arXiv preprint arXiv:2512.16201, 2025. 2

work page arXiv 2025

[4] [4]

Medical phrase ground- ing with region-phrase context contrastive alignment

Zhihao Chen, Yang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Su Kai Ooi, Lionel Tim-Ee Cheng, Choon Hua Thng, Xinxing Xu, Yong Liu, et al. Medical phrase ground- ing with region-phrase context contrastive alignment. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 371–381. Springer,

work page

[5] [5]

Multimodal computing in healthcare: Enhancing clinical decision-making through data fusion

Sayantan Dass, Sujoy Mistry, Pradyut Sarkar, and Keshav Dahal. Multimodal computing in healthcare: Enhancing clinical decision-making through data fusion. InAdvances in Healthcare using Machine Learning, pages 53–84. CRC Press, 2025. 1

work page 2025

[6] [6]

Med-glip: Advancing medical language-image pre-training with large-scale grounded dataset.arXiv preprint arXiv:2508.10528, 2025

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, and Zuozhu Liu. Med-glip: Advancing medical language-image pre-training with large- scale grounded dataset.arXiv preprint arXiv:2508.10528,

work page arXiv

[7] [7]

Mohammad Ennab and Hamid Mcheick. Advancing ai in- terpretability in medical imaging: a comparative analysis of pixel-level interpretability and grad-cam models.Machine Learning and Knowledge Extraction, 7(1):12, 2025. 2

work page 2025

[8] [8]

Vision-language mod- els for medical report generation and visual question answer- ing: A review.Frontiers in artificial intelligence, 7:1430984,

Iryna Hartsock and Ghulam Rasool. Vision-language mod- els for medical report generation and visual question answer- ing: A review.Frontiers in artificial intelligence, 7:1430984,

work page

[9] [9]

A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hal- lucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Infor- mation Systems, 43(2):1–55, 2025. 1

work page 2025

[10] [10]

Seeing the trees for the forest: rethinking weakly-supervised medical visual grounding

Ta Duc Huy, Duy Anh Huynh, Yutong Xie, Yuankai Qi, Qi Chen, Phi Le Nguyen, Sen Kim Tran, Son Lam Phung, An- ton van den Hengel, Zhibin Liao, et al. Seeing the trees for the forest: rethinking weakly-supervised medical visual grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24445–24455, 2025. 2

work page 2025

[11] [11]

B. C. Kalp ´elb´e, A. G. Adaambiik, and W. Peng. Vision- language models in medicine: A survey.arXiv preprint arXiv:2503.01863, 2025. 1, 2

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2503.13939 , year=

Yizhou Lai, Jindong Zhong, Meng Li, Shuo Zhao, Yi Li, Konstantinos Psounis, and Xinyao Yang. Med-r1: Rein- forcement learning for generalizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939,

work page arXiv

[13] [13]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Chunyuan Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 1, 2

work page 2024

[14] [14]

Aor: Anatomical ontology-guided reason- ing for medical large multimodal model in chest x-ray inter- pretation.arXiv preprint arXiv:2505.02830, 2025

Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, et al. Aor: Anatomical ontology-guided reason- ing for medical large multimodal model in chest x-ray inter- pretation.arXiv preprint arXiv:2505.02830, 2025. 1

work page arXiv 2025

[15] [15]

From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research

Amgad Muneer, Muhammad Waqas, Maliazurina B Saad, Eman Showkatian, Rukhmini Bandyopadhyay, Hui Xu, Wentao Li, Joe Y Chang, Zhongxing Liao, Cara Haymaker, et al. From classical machine learning to emerging founda- tion models: review on multimodal data integration for can- cer research.arXiv preprint arXiv:2507.09028, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

The future of radiology: The path towards mul- timodal ai and superdiagnostics.European Journal of Radi- ology Artificial Intelligence, 2:100014, 2025

Felix Nensa. The future of radiology: The path towards mul- timodal ai and superdiagnostics.European Journal of Radi- ology Artificial Intelligence, 2:100014, 2025. 1

work page 2025

[17] [17]

Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact.arXiv preprint arXiv:2507.00951, 2025

Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Ab- delrahman Eldaly, Kai Zhang, Ferhat Sadak, et al. Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact.arXiv preprint arXiv:2507.00951, 2025. 1

work page arXiv 2025

[18] [18]

Who is re- sponsible? the data, models, users or regulations? a compre- hensive survey on responsible generative ai for a sustainable future.arXiv preprint arXiv:2502.08650, 2025

Shaina Raza, Rizwan Qureshi, Anam Zahid, Safiullah Ka- mawal, Ferhat Sadak, Joseph Fioresi, Muhammaed Saeed, Ranjan Sapkota, Aditya Jain, Anas Zafar, et al. Who is re- sponsible? the data, models, users or regulations? a compre- hensive survey on responsible generative ai for a sustainable future.arXiv preprint arXiv:2502.08650, 2025. 4

work page arXiv 2025

[19] [19]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 2022. 2

work page 2022

[20] [20]

URL https://arxiv.org/abs/ 2308.02463

Chaoyi Wu, Xiaoman Zhang, Yifan Zhang, Yijie Wang, and Weidi Xie. Towards generalist foundation model for radiol- ogy.arXiv preprint arXiv:2308.02463, 2023. 2

work page arXiv 2023

[21] [21]

& Tang, H

Jiarui Ye and Hao Tang. Multimodal large language mod- els for medicine: A comprehensive survey.arXiv preprint arXiv:2504.21051, 2025. 1

work page arXiv 2025

[22] [22]

Beyond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026

Anas Zafar, Leema Krishna Murali, and Ashish Vashist. Be- yond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026. 1

work page arXiv 2026

[23] [23]

arXiv preprint arXiv:2305.17100 , volume=

Kaicheng Zhang, Jifan Yu, Ekta Adhikarla, Ru Zhou, Zhen Yan, Yutong Liu, Zicheng Liu, Liang He, Brian Davison, and Xiang Li. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks.arXiv preprint arXiv:2305.17100, 2023. 1, 2

work page arXiv 2023

[24] [24]

Can we trust ai doctors? a survey of medical hallucination in large language and vision-language models.Findings of the Asso- ciation for Computational Linguistics (ACL), 2025

Zhenyu Zhu, Yuxin Zhang, Xiahai Zhuang, et al. Can we trust ai doctors? a survey of medical hallucination in large language and vision-language models.Findings of the Asso- ciation for Computational Linguistics (ACL), 2025. 1, 2

work page 2025

[25] [25]

Uncertainty-aware medical diagnostic phrase identification and grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ke Zou, Yang Bai, Bo Liu, Yidi Chen, Zhihao Chen, Yang Zhou, Xuedong Yuan, Meng Wang, Xiaojing Shen, Xi- aochun Cao, et al. Uncertainty-aware medical diagnostic phrase identification and grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025