pith. sign in

arxiv: 2605.17436 · v1 · pith:3XTZGFFTnew · submitted 2026-05-17 · 💻 cs.CV · cs.CL

Medical Context Distorts Decisions in Clinical Vision Language Models

Pith reviewed 2026-05-20 14:54 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsclinical decision supportmodality biaschest x-raysprompt sensitivitymedical AI reliabilitymultimodal models
0
0 comments X

The pith

Vision-language models for medicine rely far more on text reports than on the actual medical images, even when the images contain clear evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how vision-language models handle chest x-ray images together with accompanying clinical text for diagnostic tasks. It shows these models base most decisions on the text input, overriding or ignoring visual details from the scans. Models also shift outputs when given irrelevant patient history and can reverse correct answers after small rewordings of the prompt. This pattern appears across both general and medically adapted models. The results indicate that current VLMs may not safely combine visual and textual medical information as intended.

Core claim

The paper establishes that VLMs exhibit modality over-reliance on text over images, spurious influence from irrelevant clinical history, and sensitivity to prompt variations. Through controlled changes to image-text alignment and prompt wording on MIMIC-CXR chest x-ray tasks, model decisions remain dominated by text even when visual evidence contradicts it, and minor prompt adjustments can reverse previously correct image-based outputs.

What carries the argument

Systematic manipulation of image-text alignment, addition of irrelevant clinical history, and reformulation of prompts to measure text dominance and sensitivity in VLM outputs on medical imaging tasks.

If this is right

  • VLMs may output incorrect diagnoses whenever text contradicts available visual evidence.
  • Irrelevant clinical history can pull model predictions away from accurate image-based conclusions.
  • Small prompt rephrasings can flip model answers even when the underlying image and core question stay the same.
  • Safeguards and stress-testing are required before any clinical deployment of these models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures may need explicit penalties for text-only shortcuts to force greater visual grounding.
  • The same text dominance could appear in other multimodal medical tasks such as pathology slides or radiology report generation.
  • Deployment studies that track model use inside actual hospital workflows would show whether these controlled failures scale to practice.

Load-bearing premise

Controlled experiments that alter text alignment and prompts on chest x-ray data from MIMIC-CXR reflect the distortions VLMs would produce during real clinical use.

What would settle it

A test in which VLMs receive clear x-ray images paired with conflicting text and their accuracy is measured against radiologist ground truth to check whether text overrides visual evidence.

Figures

Figures reproduced from arXiv: 2605.17436 by David Restrepo, Enzo Ferrante, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis.

Figure 1
Figure 1. Figure 1: Main Results Overview. (A) Modality Over-Reliance: We measure model performance under image-text conflict (A.1) and Negative Flip Rate (NFR) by modality shift (A.2). (B) Temporal Context Vulnerability: We evaluate the impact of irrelevant history length on accuracy (B.1) and NFR (B.2). (C) Semantic Prompt Fragility: Finally, we analyze inter-prompt agreement (Fleiss’ Kappa) for modality shift (C.1) and mul… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Experimental Evaluation Pipeline. (a) Selective Modality Shifting (SMS): Evaluates modality dominance by introducing conflicts between image and text. (b) Temporal Context Injection: Tests robustness to clinically irrelevant prior history. (c) Semantic Prompt Sensitivity: Assess stability across semantically equivalent prompt variations. 6/9 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates a range of general-domain and medically-tuned vision-language models on chest X-ray tasks from the MIMIC-CXR dataset. It identifies three failure modes through controlled manipulations: (1) over-reliance on text over visual evidence, (2) spurious influence from irrelevant clinical history, and (3) sensitivity to minor prompt reformulations. The central empirical finding is that VLM decisions are dominated by the text modality even when contradictory image information is present.

Significance. If the results hold under more rigorous controls, the work provides a useful empirical demonstration of modality bias and prompt fragility in clinical VLMs. The systematic use of public data and input manipulations is a strength that supports reproducibility. The findings could help motivate safeguards or stress-testing protocols for medical AI systems, though the current lack of statistical detail limits immediate impact.

major comments (2)
  1. [Methods] Methods (experimental setup for history manipulation): Adding artificial irrelevant clinical history to MIMIC-CXR reports does not include explicit controls for report quality, negation scope, or image-report alignment scores. Without these, observed decision reversals may partly reflect residual medical coherence in the data rather than pure text dominance, weakening the inference to real-world deployment.
  2. [Results] Results section: No details are provided on statistical tests, error bars, number of models or samples, or corrections for multiple comparisons. Since the central claims rest on patterns of decision changes and accuracy shifts, this omission makes it impossible to verify the robustness of the reported effects from the given text.
minor comments (2)
  1. The abstract would be clearer if it briefly specified the exact chest X-ray subtasks (e.g., finding detection vs. diagnosis) used in the evaluations.
  2. [Figures] Figure captions should explicitly list the prompt variations tested to allow readers to reproduce the sensitivity experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [Methods] Methods (experimental setup for history manipulation): Adding artificial irrelevant clinical history to MIMIC-CXR reports does not include explicit controls for report quality, negation scope, or image-report alignment scores. Without these, observed decision reversals may partly reflect residual medical coherence in the data rather than pure text dominance, weakening the inference to real-world deployment.

    Authors: We agree that more explicit controls would further strengthen the claim of pure text dominance. Our irrelevant histories were generated from a fixed set of templates introducing conditions and statements unrelated to the current chest X-ray (e.g., orthopedic or dermatologic notes), but we did not compute image-report alignment scores or systematically vary negation scope. In the revision we will add a dedicated subsection describing the template construction, provide representative examples of the manipulated reports, and include a supplementary table reporting average alignment scores (using available MIMIC-CXR metadata) for the original versus manipulated histories. revision: yes

  2. Referee: [Results] Results section: No details are provided on statistical tests, error bars, number of models or samples, or corrections for multiple comparisons. Since the central claims rest on patterns of decision changes and accuracy shifts, this omission makes it impossible to verify the robustness of the reported effects from the given text.

    Authors: We acknowledge the omission of statistical detail in the main text. The experiments were run on the full MIMIC-CXR test split (approximately 3,000 studies) across 8 VLMs, with results aggregated as mean accuracy and decision-flip rates. In the revised manuscript we will report the exact sample counts, add error bars (standard deviation across models and bootstrap 95% CIs), include paired statistical tests for accuracy and flip-rate differences, and apply Bonferroni correction for the multiple prompt and history conditions examined. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

The paper is a purely empirical study that evaluates VLMs on MIMIC-CXR chest x-ray tasks through controlled input manipulations of image-text alignment, clinical history, and prompts. It reports direct observational results on modality over-reliance, spurious reliance, and prompt sensitivity without any mathematical derivations, fitted parameters, equations, or predictive models that could reduce to inputs by construction. No self-citations are used to justify uniqueness theorems or ansatzes, and there are no load-bearing steps that rename known results or smuggle in assumptions via prior work. The central claims rest on experimental observations rather than self-referential logic, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen evaluation tasks and input manipulations are representative of clinical use; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption MIMIC-CXR chest x-ray tasks with clinical history are representative of real-world clinical vision-language scenarios
    The paper evaluates on chest x-ray tasks using MIMIC-CXR and manipulates clinical history and prompts to identify distortions.

pith-pipeline@v0.9.0 · 5698 in / 1154 out tokens · 46024 ms · 2026-05-20T14:54:54.055768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

    Dutta, N., Bose, K., Syailendra, E., Chu, L. & Gupta, P. Vision-language models in diagnostic imaging: review of technical advances, clinical validation, and practical deployment.Int. J. Med. Informatics106227 (2025). 3.Bannur, S.et al.Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449(2024). 4.Rezk, M., Silva, P. C. & Dahlweid,...

  2. [2]

    S., Kang, H., Chu, Y

    Ryu, J. S., Kang, H., Chu, Y . & Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations.Biomed. Eng. Lett.1–22 (2025)

  3. [3]

    Van, M.-H., Verma, P. & Wu, X. On large visual language models for medical imaging analysis: An empirical study. In2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), 172–176 (IEEE, 2024)

  4. [4]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora, R. K.et al.Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775(2025)

  5. [5]

    Medicine Community Heal.13, e003033 (2025)

    Grignoli, N.et al.Clinical decision fatigue: a systematic and scoping review with meta-synthesis.Fam. Medicine Community Heal.13, e003033 (2025). 9.Vally, Z. I.et al.Errors in clinical diagnosis: a narrative review.J. Int. Med. Res.51, 03000605231162798 (2023)

  6. [6]

    Y ., Zhang, W

    Sim, M. Y ., Zhang, W. E., Dai, X. & Fang, B. Can vlms actually see and read? a survey on modality collapse in vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, 24452–24470 (2025)

  7. [7]

    & Elliott, D

    Frank, S., Bugliarello, E. & Elliott, D. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers.arXiv preprint arXiv:2109.04448(2021)

  8. [8]

    Y ., Zhang, W

    Sim, M. Y ., Zhang, W. E., Dai, X. & Fang, B. Can VLMs actually see and read? a survey on modality collapse in vision-language models. In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.)Findings of the Association for Computational Linguistics: ACL 2025, 24452–24470, DOI: 10.18653/v1/2025.findings-acl.1256 (Association for Computational Linguist...

  9. [9]

    & Ferrante, E

    Restrepo, D., Ktena, I., Vakalopoulou, M., Christodoulidis, S. & Ferrante, E. On the risk of misleading reports: Diagnosing textual biases in multimodal clinical ai. In Qiu, J.et al.(eds.)AI for Clinical Applications, 320–330 (Springer Nature Switzerland, Cham, 2026)

  10. [10]

    arXiv preprint arXiv:2603.21687 , year=

    Deng, A., Cao, T., Chen, Z. & Hooi, B. Words or vision: Do vision-language models have blind faith in text? In Proceedings of the Computer Vision and Pattern Recognition Conference, 3867–3876 (2025). 15.Salazar, I.et al.Kaleidoscope: In-language exams for massively multilingual vision evaluation.ICLR(2026). 16.Asadi, M.et al.Mirage the illusion of visual ...

  11. [11]

    M., Mascheroni, P., Brooks, S., Doering, S

    Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S. & Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digit. Heal.4, e0000877 (2025)

  12. [12]

    Medicine(2025)

    Chen, X.et al.Evaluating large language models and agents in healthcare: key challenges in clinical applications.Intell. Medicine(2025)

  13. [13]

    medicine30, 2613–2622 (2024)

    Hager, P.et al.Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nat. medicine30, 2613–2622 (2024)

  14. [14]

    & Morstatter, F

    Salinas, A. & Morstatter, F. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance.arXiv preprint arXiv:2401.03729(2024)

  15. [15]

    & Bifulco, R

    Errica, F., Sanvito, D., Siracusano, G. & Bifulco, R. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1543–1558 (2025)

  16. [16]

    & Schwing, A

    Gat, I., Schwartz, I. & Schwing, A. Perceptual score: What data modalities does your model perceive?Adv. Neural Inf. Process. Syst.34, 21630–21643 (2021)

  17. [17]

    & Frank, A

    Parcalabescu, L. & Frank, A. Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4032–4059 (2023). 8/9

  18. [18]

    & Frank, A

    Parcalabescu, L. & Frank, A. Do vision & language decoders use images and text equally? how self-consistent are their explanations?arXiv preprint arXiv:2404.18624(2024)

  19. [19]

    InProceedings of the AAAI conference on artificial intelligence, vol

    Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019)

  20. [20]

    MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

    Johnson, A. E.et al.Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042(2019)

  21. [21]

    Qwen2.5-VL Technical Report

    Liu, H., Li, C., Wu, Q. & Lee, Y . J. Visual instruction tuning.Adv. neural information processing systems36, 34892–34916 (2023). 28.Bai, S.et al.Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  22. [22]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X.et al.Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811(2025). 30.Dubey, A.et al.The llama 3 herd of models.arXiv e-printsarXiv–2407 (2024). 31.Sellergren, A.et al.Medgemma technical report.arXiv preprint arXiv:2507.05201(2025). 32.Singh, A.et al.Openai gpt-5 system card.arXiv pr...

  23. [23]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14299–14308 (2021)

    Yan, S.et al.Positive-congruent training: Towards regression-free model updates. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14299–14308 (2021)

  24. [24]

    Kim, Y ., Wu, J., Abdulle, Y . & Wu, H. Medexqa: Medical question answering benchmark with multiple explanations. In Proceedings of the 23rd Workshop on biomedical natural language processing, 167–181 (2024)

  25. [25]

    Yao, Z.et al.Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553(2024)

  26. [26]

    Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2567–2577 (2019). 38.Hendrycks, D.et al.Measuring massive multi...

  27. [27]

    Medicine 8, 727 (2025)

    Cheng, Z.et al.Understanding the robustness of vision-language models to medical image artefacts.NPJ Digit. Medicine 8, 727 (2025). 40.Fleiss, J. L. Measuring nominal scale agreement among many raters.Psychol. bulletin76, 378 (1971). 41.Moody, G. B. Physionet. InEncyclopedia of computational neuroscience, 2806–2808 (Springer, 2022). Author contributions s...