Medical Context Distorts Decisions in Clinical Vision Language Models
Pith reviewed 2026-05-20 14:54 UTC · model grok-4.3
The pith
Vision-language models for medicine rely far more on text reports than on the actual medical images, even when the images contain clear evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that VLMs exhibit modality over-reliance on text over images, spurious influence from irrelevant clinical history, and sensitivity to prompt variations. Through controlled changes to image-text alignment and prompt wording on MIMIC-CXR chest x-ray tasks, model decisions remain dominated by text even when visual evidence contradicts it, and minor prompt adjustments can reverse previously correct image-based outputs.
What carries the argument
Systematic manipulation of image-text alignment, addition of irrelevant clinical history, and reformulation of prompts to measure text dominance and sensitivity in VLM outputs on medical imaging tasks.
If this is right
- VLMs may output incorrect diagnoses whenever text contradicts available visual evidence.
- Irrelevant clinical history can pull model predictions away from accurate image-based conclusions.
- Small prompt rephrasings can flip model answers even when the underlying image and core question stay the same.
- Safeguards and stress-testing are required before any clinical deployment of these models.
Where Pith is reading between the lines
- Training procedures may need explicit penalties for text-only shortcuts to force greater visual grounding.
- The same text dominance could appear in other multimodal medical tasks such as pathology slides or radiology report generation.
- Deployment studies that track model use inside actual hospital workflows would show whether these controlled failures scale to practice.
Load-bearing premise
Controlled experiments that alter text alignment and prompts on chest x-ray data from MIMIC-CXR reflect the distortions VLMs would produce during real clinical use.
What would settle it
A test in which VLMs receive clear x-ray images paired with conflicting text and their accuracy is measured against radiologist ground truth to check whether text overrides visual evidence.
Figures
read the original abstract
Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates a range of general-domain and medically-tuned vision-language models on chest X-ray tasks from the MIMIC-CXR dataset. It identifies three failure modes through controlled manipulations: (1) over-reliance on text over visual evidence, (2) spurious influence from irrelevant clinical history, and (3) sensitivity to minor prompt reformulations. The central empirical finding is that VLM decisions are dominated by the text modality even when contradictory image information is present.
Significance. If the results hold under more rigorous controls, the work provides a useful empirical demonstration of modality bias and prompt fragility in clinical VLMs. The systematic use of public data and input manipulations is a strength that supports reproducibility. The findings could help motivate safeguards or stress-testing protocols for medical AI systems, though the current lack of statistical detail limits immediate impact.
major comments (2)
- [Methods] Methods (experimental setup for history manipulation): Adding artificial irrelevant clinical history to MIMIC-CXR reports does not include explicit controls for report quality, negation scope, or image-report alignment scores. Without these, observed decision reversals may partly reflect residual medical coherence in the data rather than pure text dominance, weakening the inference to real-world deployment.
- [Results] Results section: No details are provided on statistical tests, error bars, number of models or samples, or corrections for multiple comparisons. Since the central claims rest on patterns of decision changes and accuracy shifts, this omission makes it impossible to verify the robustness of the reported effects from the given text.
minor comments (2)
- The abstract would be clearer if it briefly specified the exact chest X-ray subtasks (e.g., finding detection vs. diagnosis) used in the evaluations.
- [Figures] Figure captions should explicitly list the prompt variations tested to allow readers to reproduce the sensitivity experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the changes we will make in the revised version.
read point-by-point responses
-
Referee: [Methods] Methods (experimental setup for history manipulation): Adding artificial irrelevant clinical history to MIMIC-CXR reports does not include explicit controls for report quality, negation scope, or image-report alignment scores. Without these, observed decision reversals may partly reflect residual medical coherence in the data rather than pure text dominance, weakening the inference to real-world deployment.
Authors: We agree that more explicit controls would further strengthen the claim of pure text dominance. Our irrelevant histories were generated from a fixed set of templates introducing conditions and statements unrelated to the current chest X-ray (e.g., orthopedic or dermatologic notes), but we did not compute image-report alignment scores or systematically vary negation scope. In the revision we will add a dedicated subsection describing the template construction, provide representative examples of the manipulated reports, and include a supplementary table reporting average alignment scores (using available MIMIC-CXR metadata) for the original versus manipulated histories. revision: yes
-
Referee: [Results] Results section: No details are provided on statistical tests, error bars, number of models or samples, or corrections for multiple comparisons. Since the central claims rest on patterns of decision changes and accuracy shifts, this omission makes it impossible to verify the robustness of the reported effects from the given text.
Authors: We acknowledge the omission of statistical detail in the main text. The experiments were run on the full MIMIC-CXR test split (approximately 3,000 studies) across 8 VLMs, with results aggregated as mean accuracy and decision-flip rates. In the revised manuscript we will report the exact sample counts, add error bars (standard deviation across models and bootstrap 95% CIs), include paired statistical tests for accuracy and flip-rate differences, and apply Bonferroni correction for the multiple prompt and history conditions examined. revision: yes
Circularity Check
No significant circularity in empirical evaluation study
full rationale
The paper is a purely empirical study that evaluates VLMs on MIMIC-CXR chest x-ray tasks through controlled input manipulations of image-text alignment, clinical history, and prompts. It reports direct observational results on modality over-reliance, spurious reliance, and prompt sensitivity without any mathematical derivations, fitted parameters, equations, or predictive models that could reduce to inputs by construction. No self-citations are used to justify uniqueness theorems or ansatzes, and there are no load-bearing steps that rename known results or smuggle in assumptions via prior work. The central claims rest on experimental observations rather than self-referential logic, making the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MIMIC-CXR chest x-ray tasks with clinical history are representative of real-world clinical vision-language scenarios
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NFR is defined as the proportion of samples that are correctly classified in the baseline setting but become misclassified under a contextual perturbation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024
Dutta, N., Bose, K., Syailendra, E., Chu, L. & Gupta, P. Vision-language models in diagnostic imaging: review of technical advances, clinical validation, and practical deployment.Int. J. Med. Informatics106227 (2025). 3.Bannur, S.et al.Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449(2024). 4.Rezk, M., Silva, P. C. & Dahlweid,...
-
[2]
Ryu, J. S., Kang, H., Chu, Y . & Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations.Biomed. Eng. Lett.1–22 (2025)
work page 2025
-
[3]
Van, M.-H., Verma, P. & Wu, X. On large visual language models for medical imaging analysis: An empirical study. In2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), 172–176 (IEEE, 2024)
work page 2024
-
[4]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Arora, R. K.et al.Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Medicine Community Heal.13, e003033 (2025)
Grignoli, N.et al.Clinical decision fatigue: a systematic and scoping review with meta-synthesis.Fam. Medicine Community Heal.13, e003033 (2025). 9.Vally, Z. I.et al.Errors in clinical diagnosis: a narrative review.J. Int. Med. Res.51, 03000605231162798 (2023)
work page 2025
-
[6]
Sim, M. Y ., Zhang, W. E., Dai, X. & Fang, B. Can vlms actually see and read? a survey on modality collapse in vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, 24452–24470 (2025)
work page 2025
-
[7]
Frank, S., Bugliarello, E. & Elliott, D. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers.arXiv preprint arXiv:2109.04448(2021)
-
[8]
Sim, M. Y ., Zhang, W. E., Dai, X. & Fang, B. Can VLMs actually see and read? a survey on modality collapse in vision-language models. In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.)Findings of the Association for Computational Linguistics: ACL 2025, 24452–24470, DOI: 10.18653/v1/2025.findings-acl.1256 (Association for Computational Linguist...
-
[9]
Restrepo, D., Ktena, I., Vakalopoulou, M., Christodoulidis, S. & Ferrante, E. On the risk of misleading reports: Diagnosing textual biases in multimodal clinical ai. In Qiu, J.et al.(eds.)AI for Clinical Applications, 320–330 (Springer Nature Switzerland, Cham, 2026)
work page 2026
-
[10]
arXiv preprint arXiv:2603.21687 , year=
Deng, A., Cao, T., Chen, Z. & Hooi, B. Words or vision: Do vision-language models have blind faith in text? In Proceedings of the Computer Vision and Pattern Recognition Conference, 3867–3876 (2025). 15.Salazar, I.et al.Kaleidoscope: In-language exams for massively multilingual vision evaluation.ICLR(2026). 16.Asadi, M.et al.Mirage the illusion of visual ...
-
[11]
M., Mascheroni, P., Brooks, S., Doering, S
Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S. & Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digit. Heal.4, e0000877 (2025)
work page 2025
-
[12]
Chen, X.et al.Evaluating large language models and agents in healthcare: key challenges in clinical applications.Intell. Medicine(2025)
work page 2025
-
[13]
Hager, P.et al.Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nat. medicine30, 2613–2622 (2024)
work page 2024
-
[14]
Salinas, A. & Morstatter, F. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance.arXiv preprint arXiv:2401.03729(2024)
-
[15]
Errica, F., Sanvito, D., Siracusano, G. & Bifulco, R. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1543–1558 (2025)
work page 2025
-
[16]
Gat, I., Schwartz, I. & Schwing, A. Perceptual score: What data modalities does your model perceive?Adv. Neural Inf. Process. Syst.34, 21630–21643 (2021)
work page 2021
-
[17]
Parcalabescu, L. & Frank, A. Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4032–4059 (2023). 8/9
work page 2023
-
[18]
Parcalabescu, L. & Frank, A. Do vision & language decoders use images and text equally? how self-consistent are their explanations?arXiv preprint arXiv:2404.18624(2024)
-
[19]
InProceedings of the AAAI conference on artificial intelligence, vol
Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019)
work page 2019
-
[20]
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Johnson, A. E.et al.Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042(2019)
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[21]
Liu, H., Li, C., Wu, Q. & Lee, Y . J. Visual instruction tuning.Adv. neural information processing systems36, 34892–34916 (2023). 28.Bai, S.et al.Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Chen, X.et al.Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811(2025). 30.Dubey, A.et al.The llama 3 herd of models.arXiv e-printsarXiv–2407 (2024). 31.Sellergren, A.et al.Medgemma technical report.arXiv preprint arXiv:2507.05201(2025). 32.Singh, A.et al.Openai gpt-5 system card.arXiv pr...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yan, S.et al.Positive-congruent training: Towards regression-free model updates. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14299–14308 (2021)
work page 2021
-
[24]
Kim, Y ., Wu, J., Abdulle, Y . & Wu, H. Medexqa: Medical question answering benchmark with multiple explanations. In Proceedings of the 23rd Workshop on biomedical natural language processing, 167–181 (2024)
work page 2024
- [25]
-
[26]
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2567–2577 (2019). 38.Hendrycks, D.et al.Measuring massive multi...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Cheng, Z.et al.Understanding the robustness of vision-language models to medical image artefacts.NPJ Digit. Medicine 8, 727 (2025). 40.Fleiss, J. L. Measuring nominal scale agreement among many raters.Psychol. bulletin76, 378 (1971). 41.Moody, G. B. Physionet. InEncyclopedia of computational neuroscience, 2806–2808 (Springer, 2022). Author contributions s...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.