When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

Doohyun Park; Inhyuk Park

arxiv: 2606.18262 · v1 · pith:QZH6HXQCnew · submitted 2026-05-11 · 💻 cs.HC

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

Inhyuk Park , Doohyun Park This is my paper

Pith reviewed 2026-06-30 22:44 UTC · model grok-4.3

classification 💻 cs.HC

keywords MLLMtextual dominancediagnostic biasfundus imagesprompting strategiesBRSET datasetChain-of-Thoughtmedical imaging

0 comments

The pith

Text prompts override correct visual lesion contours in an ophthalmology MLLM, dropping accuracy from 75% to 46%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether prompting strategies reliably support diagnostic reasoning in medical MLLMs by running controlled experiments on FundusExpert-1B with the BRSET fundus dataset. It shows that the model keeps coarse spatial grounding from images, yet one-shot text prompts bias outputs toward the prompted class, and when text directly contradicts an overlaid contour the text wins. Accuracy falls sharply relative to the visual-only baseline, and adding Chain-of-Thought steps increases rather than reduces the error. Because prompting is the main practical way to adapt these models to medicine without retraining, the bias points to a concrete risk for clinical use.

Core claim

In a hemorrhage-versus-drusen task on the BRSET dataset, FundusExpert-1B retains region-level spatial grounding when markers are injected, yet one-shot textual prompts shift predictions toward the prompted finding; when an overlaid lesion contour is paired with an inconsistent textual claim, the text overrides the visual cue and overall accuracy drops from 75% to 46% relative to the visual-only condition, while Chain-of-Thought reasoning produces further degradation rather than self-correction.

What carries the argument

The conflicting-prompt probe that pairs artificially injected lesion contours with inconsistent textual claims on fundus images.

If this is right

One-shot textual prompts bias predictions toward the prompted finding even when visual evidence is present.
The model retains coarse, region-level spatial grounding from images alone.
Chain-of-Thought reasoning is associated with further performance degradation in the presence of conflicting text.
Prompting strategies alone may be insufficient for safe clinical deployment of medical MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar textual dominance could appear when clinicians supply free-text descriptions alongside images in real workflows.
The bias may affect other MLLMs that rely primarily on prompting rather than task-specific fine-tuning.
Direct comparison of unmodified versus artificially marked images would test whether the observed override generalizes beyond the probe setup.

Load-bearing premise

The controlled probe with artificially injected markers and overlaid contours isolates textual dominance without introducing image artifacts or response biases that would not occur on unmodified clinical images.

What would settle it

Re-running the same conflicting-prompt trials on unmodified clinical images without artificial markers or contours and finding no accuracy drop when text contradicts the image.

Figures

Figures reproduced from arXiv: 2606.18262 by Doohyun Park, Inhyuk Park.

**Figure 1.** Figure 1: Three-stage evaluation pipeline on a frozen FundusExpert-1B. (A) Visual grounding probe: an artificially injected blue marker is overlaid on a normal fundus image, and the model is queried for the marker’s presence, color, and approximate location. (B) Diagnostic discrimination (Hemorrhage vs. Drusen) with a one-shot clinical description supplied as a textual prior. (C) Multimodal prompting: an overlaid le… view at source ↗

read the original abstract

Multimodal large language models (MLLMs) are increasingly being evaluated for medical applications, where computational constraints often make prompting strategies the only practical alternative to fine-tuning. Such strategies are generally assumed to support diagnostic reasoning, yet their potential failure modes in medical MLLMs remain poorly characterized. We analyze FundusExpert-1B, an open-source ophthalmology MLLM, on a hemorrhage versus drusen discrimination task using the public BRSET dataset, adopted here as a controlled testbed for our analysis. (i) A controlled probe with artificially injected markers confirms that the model retains coarse, region-level spatial grounding. (ii) Compared with zero-shot inference, one-shot textual prompts bias predictions toward the prompted finding. (iii) When an overlaid lesion contour is paired with an inconsistent textual claim, the textual prompt overrides the correct visual cue: overall accuracy drops from 75% to 46% relative to the visual-only condition, and Chain-of-Thought (CoT) reasoning is associated with further degradation rather than self-correction. Although limited to a single model and dataset, our findings suggest that prompting strategies alone may be insufficient for the safe clinical deployment of medical MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures a clear accuracy drop from 75% to 46% when conflicting text is added to contour-overlaid images in one ophthalmology MLLM, but the overlay step may itself change the visual evidence.

read the letter

The core finding is straightforward: on the BRSET hemorrhage-drusen task with FundusExpert-1B, zero-shot visual performance sits at 75%, one-shot text biases the output, and pairing an overlaid contour with mismatched text drops accuracy to 46%. Chain-of-thought makes the drop worse rather than better. They first run a marker-injection check that shows the model can use coarse spatial cues, which gives the override result a baseline to stand on.

The work is empirical and uses a public dataset plus an open model, so the measurements can be checked. That is the main positive: a quantified demonstration of text dominance on a medical discrimination task that prior prompt-bias papers had not reported in this setting.

The limits are obvious and stated in the abstract. Everything rests on a single model and single dataset. The abstract gives no error bars, no exact prompt text, and no statistical tests. The bigger open question is whether the contour overlay itself preserves the original lesion boundaries or adds edges and intensity shifts that the model could exploit separately from the text. The marker probe only confirms region-level attention; it does not verify that the specific contours used in the conflict condition leave the diagnostic features intact. If the overlay changes the image in ways the model notices, the accuracy drop cannot be attributed cleanly to textual override.

This is useful for anyone running prompt-only medical MLLMs or studying vision-language reliability. It flags a concrete failure mode worth testing on other models and unmodified images. The paper is narrow but the measurement is direct enough that a serious referee should see it.

Referee Report

1 major / 3 minor

Summary. The paper evaluates the FundusExpert-1B ophthalmology MLLM on the BRSET dataset for a hemorrhage-versus-drusen task. It reports three main findings from controlled experiments: (i) marker-injection probes confirm coarse region-level spatial grounding; (ii) one-shot textual prompts bias predictions toward the prompted class; (iii) when an overlaid lesion contour is paired with an inconsistent textual claim, accuracy falls from 75% (visual-only) to 46%, with Chain-of-Thought reasoning associated with further degradation rather than correction. The work is restricted to a single model and dataset but concludes that prompting alone may be insufficient for safe clinical deployment of medical MLLMs.

Significance. If the central empirical result holds after addressing methodological concerns, the paper provides direct evidence that textual prompts can override intact visual cues in a medical MLLM, with measurable accuracy loss and no self-correction from CoT. This is a concrete, falsifiable measurement on a public dataset that highlights a practical failure mode for prompting-based medical applications. The controlled conflicting-prompt design is a methodological strength; however, the single-model, single-dataset scope limits immediate generalizability.

major comments (1)

[conflicting-prompt setup and probe description] The section describing the conflicting-prompt setup and overlaid lesion contours (abstract point (iii) and the probe description): the paper does not report any verification that the artificial contour overlay preserves the original diagnostic visual features (e.g., hemorrhage vs. drusen boundaries) without introducing new edges, intensity shifts, or segmentation artifacts. Because the accuracy drop (75% to 46%) is attributed to textual dominance over the "correct visual cue," this omission is load-bearing; an artifactual change in the image could independently drive the performance change.

minor comments (3)

[abstract and results] The abstract and results sections supply no error bars, confidence intervals, or statistical tests for the reported accuracy figures (75% and 46%).
[methods] Exact prompt wording, including the one-shot and CoT templates, is not provided; this prevents direct replication of the bias measurements.
[abstract and conclusion] The work is limited to a single model (FundusExpert-1B) and dataset (BRSET); this is acknowledged but should be stated more prominently as a boundary condition on the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this methodological detail in the conflicting-prompt experiments. The concern is substantive and we address it directly below, with plans to revise the manuscript.

read point-by-point responses

Referee: The section describing the conflicting-prompt setup and overlaid lesion contours (abstract point (iii) and the probe description): the paper does not report any verification that the artificial contour overlay preserves the original diagnostic visual features (e.g., hemorrhage vs. drusen boundaries) without introducing new edges, intensity shifts, or segmentation artifacts. Because the accuracy drop (75% to 46%) is attributed to textual dominance over the "correct visual cue," this omission is load-bearing; an artifactual change in the image could independently drive the performance change.

Authors: We agree the manuscript currently lacks explicit verification of the overlay process, which is a legitimate gap given the load-bearing role of the result. The contours were generated from the BRSET dataset's original lesion annotations and rendered as thin lines (with minimal alpha blending) to mark the correct region without changing underlying pixel values. However, this description alone does not constitute verification. In the revised version we will add: (1) the exact overlay algorithm and parameters, (2) quantitative checks (mean absolute pixel difference and edge-preservation metrics between original and overlaid images, restricted to non-contour regions), and (3) representative side-by-side examples confirming that hemorrhage vs. drusen boundaries and intensities remain unaltered. These additions will isolate the textual prompt as the source of the accuracy drop. We view this as a necessary strengthening of the experimental claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on public data

full rationale

The paper consists entirely of controlled experiments measuring accuracy on the BRSET dataset under zero-shot, one-shot, and conflicting-prompt conditions for the FundusExpert-1B model. No equations, fitted parameters, derivations, or predictions appear. The reported accuracy drop (75% to 46%) is a direct empirical observation, not a quantity defined or forced by any internal construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The study is self-contained against external benchmarks (public dataset, open model) with no reduction of claims to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about the validity of accuracy as a diagnostic metric and the representativeness of the chosen public dataset and task for testing prompt bias.

axioms (1)

domain assumption Accuracy on the BRSET hemorrhage-versus-drusen task is a valid proxy for diagnostic bias in ophthalmology MLLMs.
The paper adopts BRSET as the controlled testbed without additional justification in the abstract.

pith-pipeline@v0.9.1-grok · 5744 in / 1230 out tokens · 31413 ms · 2026-06-30T22:44:41.330643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Instance-level expert knowledge and aggregate discriminative attention for radiology report generation

Shenshen Bu, Taiji Li, Yuedong Yang, and Zhiming Dai. Instance-level expert knowledge and aggregate discriminative attention for radiology report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14194–14204, 2024

2024
[2]

Dy- namic knowledge prompt for chest x-ray report generation

Shenshen Bu, Yujie Song, Taiji Li, and Zhiming Dai. Dy- namic knowledge prompt for chest x-ray report generation. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evalua- tion (LREC-COLING 2024), pages 5425–5436, 2024

2024
[3]

A deep learning based automatic report generator for retinal optical coherence tomography images

Xinjian Chen, Huazhu Fu, Jingtao Wang, Tian Lin, Qian Cheng, Cangxin Li, Meng Wang, Zhongyue Chen, Aidi Lin, Anlin Zhang, et al. A deep learning based automatic report generator for retinal optical coherence tomography images. npj Digital Medicine, 8(1):618, 2025

2025
[4]

Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Han- pin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24732–24741, 2025

2025
[5]

Visual prompt engineering for vision language models in radiology

Stefan Denner, Markus Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, and Klaus Maier-Hein. Visual prompt engineering for vision language models in radiology. arXiv preprint arXiv:2408.15802, 2024

work page arXiv 2024
[6]

Llava-next-med: medical mul- timodal large language model

Yunfei Guo and Wu Huang. Llava-next-med: medical mul- timodal large language model. In2025 asia-europe confer- ence on cybersecurity, internet of things and soft computing (CITSC), pages 474–477. IEEE, 2025

2025
[7]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Hallucination augmented contrastive learning for multimodal large language model

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27036–27046, 2024

2024
[9]

Kanukollu and Syed S

Vikram M. Kanukollu and Syed S. Ahmad. Retinal Hemor- rhage. InStatPearls. StatPearls Publishing, Treasure Island (FL), 2026

2026
[10]

A comprehensive survey of foundation models in medicine.IEEE Reviews in Biomedical Engineering, 2025

Wasif Khan, Seowung Leem, Kyle B See, Joshua K Wong, Shaoting Zhang, and Ruogu Fang. A comprehensive survey of foundation models in medicine.IEEE Reviews in Biomedical Engineering, 2025

2025
[11]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

2024
[12]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- roaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

2023
[13]

Constructing ophthalmic mllm for positioning-diagnosis collaboration through clinical cog- nitive chain reasoning

Xinyao Liu and Diping Song. Constructing ophthalmic mllm for positioning-diagnosis collaboration through clinical cog- nitive chain reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21547– 21556, 2025

2025
[14]

A brazilian multilabel ophthalmo- logical dataset (brset).PhysioNet, 13026:2, 2023

Luis Filipe Nakayama, Mariana Goncalves, L Zago Ribeiro, Helen Santos, Daniel Ferraz, Fernando Malerbi, Leo Anthony Celi, and Caio Regatieri. A brazilian multilabel ophthalmo- logical dataset (brset).PhysioNet, 13026:2, 2023

2023
[15]

Vila-m3: Enhancing vision- language models with medical expert knowledge

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myronenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, et al. Vila-m3: Enhancing vision- language models with medical expert knowledge. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 14788–14798, 2025

2025
[16]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Clinical prompt learning with frozen language models.IEEE Transactions on Neural Networks and Learning Systems, 35(11):16453– 16463, 2023

Niall Taylor, Yi Zhang, Dan W Joyce, Ziming Gao, Andrey Kormilitzin, and Alejo Nevado-Holgado. Clinical prompt learning with frozen language models.IEEE Transactions on Neural Networks and Learning Systems, 35(11):16453– 16463, 2023

2023
[19]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

2024
[20]

VanDenLangenberg and Michael P

Anna M. VanDenLangenberg and Michael P. Carson. Drusen Bodies. InStatPearls. StatPearls Publishing, Treasure Island (FL), 2026

2026
[21]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022

2022
[22]

One-prompt to segment all medical images

Junde Wu and Min Xu. One-prompt to segment all medical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302–11312, 2024

2024
[23]

Debiasing multimodal large language mod- els via noise-aware preference optimization

Zefeng Zhang, Hengzhu Tang, Jiawei Sheng, Zhenyu Zhang, Yiming Ren, Zhenyang Li, Dawei Yin, Duohe Ma, and Tingwen Liu. Debiasing multimodal large language mod- els via noise-aware preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9423–9433, 2025

2025

[1] [1]

Instance-level expert knowledge and aggregate discriminative attention for radiology report generation

Shenshen Bu, Taiji Li, Yuedong Yang, and Zhiming Dai. Instance-level expert knowledge and aggregate discriminative attention for radiology report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14194–14204, 2024

2024

[2] [2]

Dy- namic knowledge prompt for chest x-ray report generation

Shenshen Bu, Yujie Song, Taiji Li, and Zhiming Dai. Dy- namic knowledge prompt for chest x-ray report generation. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evalua- tion (LREC-COLING 2024), pages 5425–5436, 2024

2024

[3] [3]

A deep learning based automatic report generator for retinal optical coherence tomography images

Xinjian Chen, Huazhu Fu, Jingtao Wang, Tian Lin, Qian Cheng, Cangxin Li, Meng Wang, Zhongyue Chen, Aidi Lin, Anlin Zhang, et al. A deep learning based automatic report generator for retinal optical coherence tomography images. npj Digital Medicine, 8(1):618, 2025

2025

[4] [4]

Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Han- pin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24732–24741, 2025

2025

[5] [5]

Visual prompt engineering for vision language models in radiology

Stefan Denner, Markus Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, and Klaus Maier-Hein. Visual prompt engineering for vision language models in radiology. arXiv preprint arXiv:2408.15802, 2024

work page arXiv 2024

[6] [6]

Llava-next-med: medical mul- timodal large language model

Yunfei Guo and Wu Huang. Llava-next-med: medical mul- timodal large language model. In2025 asia-europe confer- ence on cybersecurity, internet of things and soft computing (CITSC), pages 474–477. IEEE, 2025

2025

[7] [7]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Hallucination augmented contrastive learning for multimodal large language model

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27036–27046, 2024

2024

[9] [9]

Kanukollu and Syed S

Vikram M. Kanukollu and Syed S. Ahmad. Retinal Hemor- rhage. InStatPearls. StatPearls Publishing, Treasure Island (FL), 2026

2026

[10] [10]

A comprehensive survey of foundation models in medicine.IEEE Reviews in Biomedical Engineering, 2025

Wasif Khan, Seowung Leem, Kyle B See, Joshua K Wong, Shaoting Zhang, and Ruogu Fang. A comprehensive survey of foundation models in medicine.IEEE Reviews in Biomedical Engineering, 2025

2025

[11] [11]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

2024

[12] [12]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- roaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

2023

[13] [13]

Constructing ophthalmic mllm for positioning-diagnosis collaboration through clinical cog- nitive chain reasoning

Xinyao Liu and Diping Song. Constructing ophthalmic mllm for positioning-diagnosis collaboration through clinical cog- nitive chain reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21547– 21556, 2025

2025

[14] [14]

A brazilian multilabel ophthalmo- logical dataset (brset).PhysioNet, 13026:2, 2023

Luis Filipe Nakayama, Mariana Goncalves, L Zago Ribeiro, Helen Santos, Daniel Ferraz, Fernando Malerbi, Leo Anthony Celi, and Caio Regatieri. A brazilian multilabel ophthalmo- logical dataset (brset).PhysioNet, 13026:2, 2023

2023

[15] [15]

Vila-m3: Enhancing vision- language models with medical expert knowledge

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myronenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, et al. Vila-m3: Enhancing vision- language models with medical expert knowledge. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 14788–14798, 2025

2025

[16] [16]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Clinical prompt learning with frozen language models.IEEE Transactions on Neural Networks and Learning Systems, 35(11):16453– 16463, 2023

Niall Taylor, Yi Zhang, Dan W Joyce, Ziming Gao, Andrey Kormilitzin, and Alejo Nevado-Holgado. Clinical prompt learning with frozen language models.IEEE Transactions on Neural Networks and Learning Systems, 35(11):16453– 16463, 2023

2023

[19] [19]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

2024

[20] [20]

VanDenLangenberg and Michael P

Anna M. VanDenLangenberg and Michael P. Carson. Drusen Bodies. InStatPearls. StatPearls Publishing, Treasure Island (FL), 2026

2026

[21] [21]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022

2022

[22] [22]

One-prompt to segment all medical images

Junde Wu and Min Xu. One-prompt to segment all medical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302–11312, 2024

2024

[23] [23]

Debiasing multimodal large language mod- els via noise-aware preference optimization

Zefeng Zhang, Hengzhu Tang, Jiawei Sheng, Zhenyu Zhang, Yiming Ren, Zhenyang Li, Dawei Yin, Duohe Ma, and Tingwen Liu. Debiasing multimodal large language mod- els via noise-aware preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9423–9433, 2025

2025