Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Pith reviewed 2026-05-19 10:50 UTC · model grok-4.3
The pith
Synthetic clinical demonstrations defend medical vision-language models against harmful queries while preserving performance on legitimate tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An inference-time defense strategy based on synthetic clinical demonstrations enhances safety in medical vision-language models by mitigating harmful queries and jailbreak attacks, without significantly compromising performance on benign clinical queries, as shown across diverse datasets from nine imaging modalities.
What carries the argument
Synthetic clinical demonstrations provided as few-shot examples at inference time to guide the model to reject harmful queries while accepting benign clinical ones.
If this is right
- Med-VLMs can reject harmful requests such as insurance fraud instructions without retraining the model.
- Larger numbers of synthetic demonstrations reduce the rate at which valid clinical queries are wrongly rejected.
- A mixed demonstration strategy balances safety and performance when only a small demonstration budget is available.
- The defense applies to both visual and textual jailbreak attacks.
Where Pith is reading between the lines
- The same demonstration-based steering could be tested on vision-language models in other high-stakes domains such as legal or financial analysis.
- Real deployments would require checking whether the synthetic demonstrations continue to work against new attack types not covered in the original experiments.
- Combining this inference-time method with lightweight fine-tuning could produce stronger safety guarantees.
Load-bearing premise
That synthetic demonstrations generated by the authors can reliably steer model behavior to reject harmful queries while accepting benign clinical ones across different Med-VLMs and real deployment conditions.
What would settle it
A test showing that a Med-VLM still answers harmful queries or rejects many valid clinical queries even after the synthetic demonstrations are added.
Figures
read the original abstract
Generative medical vision-language models~(Med-VLMs) are primarily designed to generate complex textual information~(e.g., diagnostic reports) from multimodal inputs including vision modality~(e.g., medical images) and language modality~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textit{Provide detailed instructions for using this CT scan for insurance fraud}. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a novel inference-time defense strategy for generative medical vision-language models (Med-VLMs) to mitigate harmful queries, including visual and textual jailbreak attacks, by leveraging synthetic clinical demonstrations. Evaluated on medical imaging datasets from nine modalities, the strategy is shown to enhance safety while avoiding significant performance degradation on benign queries. The authors also observe that larger demonstration budgets reduce over-defense and propose a mixed demonstration approach for few-shot scenarios.
Significance. This work addresses an important gap in the security of Med-VLMs, which are increasingly used for diagnostic reports from multimodal inputs. A successful inference-time defense using synthetic demonstrations could be valuable for practical deployment, as it avoids the need for model retraining. The empirical finding regarding demonstration budget and over-defense provides actionable insights for balancing safety and utility.
major comments (4)
- The abstract states that the defense enhances model safety without significantly compromising performance across nine modalities, but does not include any quantitative metrics, baseline comparisons, or specific results. This makes it challenging to evaluate the strength of the central claim.
- The process for generating the synthetic clinical demonstrations is underspecified. It is unclear how the demonstrations are constructed, whether they are independent of the evaluation queries, or if they were designed with knowledge of the target models' failure modes. This is load-bearing for the claim of reliable steering to reject harmful queries while accepting benign ones.
- There is no discussion of generalization to unseen attacks or ablations on demonstration provenance and held-out sets. Without this, it is difficult to separate the observed safety gains from potential overfitting to the authors' synthetic data.
- The claim that increasing the demonstration budget alleviates the over-defense issue requires supporting quantitative evidence, such as tables showing rejection rates for harmful vs. benign queries at different budget levels.
minor comments (1)
- Clarify the exact number of models and specific Med-VLMs used in the experiments for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: The abstract states that the defense enhances model safety without significantly compromising performance across nine modalities, but does not include any quantitative metrics, baseline comparisons, or specific results. This makes it challenging to evaluate the strength of the central claim.
Authors: We agree that the abstract would benefit from including key quantitative highlights to better support the central claim. In the revised manuscript, we will update the abstract to incorporate specific metrics such as average safety improvement percentages and benign query performance retention rates across the nine modalities, along with brief references to baseline comparisons from the main results. revision: yes
-
Referee: The process for generating the synthetic clinical demonstrations is underspecified. It is unclear how the demonstrations are constructed, whether they are independent of the evaluation queries, or if they were designed with knowledge of the target models' failure modes. This is load-bearing for the claim of reliable steering to reject harmful queries while accepting benign ones.
Authors: The demonstrations are constructed via a template-driven process using general clinical safety guidelines and example query-response pairs that illustrate rejection of harmful content and acceptance of benign queries. They are generated independently of the evaluation queries to avoid leakage, and we did not incorporate knowledge of specific model failure modes beyond broad jailbreak categories. We will expand the Methods section with a full specification of the generation procedure, including pseudocode and representative examples, to address this concern. revision: yes
-
Referee: There is no discussion of generalization to unseen attacks or ablations on demonstration provenance and held-out sets. Without this, it is difficult to separate the observed safety gains from potential overfitting to the authors' synthetic data.
Authors: We evaluated safety gains on a diverse collection of attacks drawn from multiple sources, with some held-out from demonstration construction. To strengthen the manuscript, we will add explicit discussion of generalization to unseen attacks and include ablations examining demonstration provenance and performance on held-out evaluation sets. revision: yes
-
Referee: The claim that increasing the demonstration budget alleviates the over-defense issue requires supporting quantitative evidence, such as tables showing rejection rates for harmful vs. benign queries at different budget levels.
Authors: Quantitative support for this claim appears in Section 4.3 and the associated figure, which report rejection rates for harmful and benign queries at budgets of 0, 2, 4, and 8 demonstrations. These results demonstrate that larger budgets preserve high rejection of harmful queries while lowering over-defense on benign queries. We will add a dedicated summary table of these rates and improve textual clarification of the trend. revision: partial
Circularity Check
No significant circularity; empirical evaluation is self-contained
full rationale
The paper proposes an inference-time defense strategy using synthetic clinical demonstrations and evaluates its effect on safety versus performance via experiments on external medical imaging datasets spanning nine modalities. No equations, derivations, or fitted parameters are presented that reduce the claimed safety gain to a quantity defined by the same data or by self-citation. The central result is an empirical demonstration rather than a mathematical reduction, and the method is assessed against held-out queries and multiple Med-VLMs, satisfying the criteria for an independent finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic clinical demonstrations can be constructed that reliably distinguish harmful from benign queries for the target models.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Ben- ton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37:129696–129742, 2025
work page 2025
-
[3]
Ask me anything: A simple strategy for prompting language models
Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R´ e. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441, 2022
-
[4]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In R0-FoMo: Robustness of Few-shot and Zero- shot Learning in Large Foundation Models , 2023
work page 2023
-
[5]
Struq: Defending against prompt injection with structured queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. arXiv preprint arXiv:2402.06363, 2024
-
[6]
Chexagent: To- wards a foundation model for chest x-ray interpretation
Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: To- wards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208, 2024
-
[7]
Attacking large language models with projected gra- dient descent
Simon Geisler, Tom Wollschl¨ ager, MHI Abdalla, Johannes Gasteiger, and Stephan G¨ unnemann. Attacking large language models with projected gra- dient descent. arXiv preprint arXiv:2402.09154 , 2024
-
[8]
Shantanu Ghosh, Clare B Poynton, Shyam Visweswaran, and Kayhan Bat- manghelich. Mammo-clip: A vision language foundation model to enhance data efficiency and robustness in mammography. In International Con- ference on Medical Image Computing and Computer-Assisted Intervention , pages 632–642. Springer, 2024
work page 2024
-
[9]
Vision-language models for medical report generation and visual question answering: A review
Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence, 7:1430984, 2024
work page 2024
-
[10]
Cross-modality jailbreak and mismatched attacks on medical multimodal large language models
Xijie Huang, Xinyuan Wang, Hantao Zhang, Jiawen Xi, Jingkun An, Hao Wang, and Chengwei Pan. Cross-modality jailbreak and mismatched attacks on medical multimodal large language models. arXiv preprint arXiv:2405.20775, 2024. 12
-
[11]
Promptsmooth: Certifying robustness of medical vision-language models via prompt learning
Noor Hussein, Fahad Shamshad, Muzammal Naseer, and Karthik Nan- dakumar. Promptsmooth: Certifying robustness of medical vision-language models via prompt learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 698–708. Springer, 2024
work page 2024
-
[12]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Break the breakout: Rein- venting lm defense against jailbreak attacks with self-refinement
Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the breakout: Rein- venting lm defense against jailbreak attacks with self-refinement. arXiv preprint arXiv:2402.15180, 2024
-
[14]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rol- land, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4015–4026, 2023
work page 2023
-
[15]
Llava- med: Training a large language-and-vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava- med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[16]
Prism: A promptable and robust interactive segmentation model with visual prompts
Hao Li, Han Liu, Dewei Hu, Jiacheng Wang, and Ipek Oguz. Prism: A promptable and robust interactive segmentation model with visual prompts. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 389–399. Springer, 2024
work page 2024
-
[17]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual in- struction tuning. Advances in neural information processing systems , 36, 2024
work page 2024
-
[18]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt- 3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Ar- chitectures, pages 100–114, 2022
work page 2022
-
[19]
Safety alignment for vision language models
Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. Safety alignment for vision language models. arXiv preprint arXiv:2405.13581 , 2024
-
[20]
A multimodal generative ai copilot for human pathology
Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology. Nature, 634(8033):466–473, 2024. 13
work page 2024
-
[21]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Han- naneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstra- tions: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 11048–11064, 2022
work page 2022
-
[22]
Radiology objects in context (roco): a multimodal image dataset
Obioma Pelka, Sven Koitka, Johannes R¨ uckert, Felix Nensa, and Christoph M Friedrich. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthe- sis: 7th Joint International Workshop, CVII-STENT 2018 and Third In- ternation...
work page 2018
-
[23]
Visual adversarial examples jailbreak aligned large language models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21527–21536, 2024
work page 2024
-
[24]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–
-
[25]
Prompt programming for large lan- guage models: Beyond the few-shot paradigm
Laria Reynolds and Kyle McDonell. Prompt programming for large lan- guage models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems , pages 1–7, 2021
work page 2021
-
[26]
Medicat: A dataset of medical images, captions, and textual references
Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Sachin Mehta, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Han- naneh Hajishirzi. Medicat: A dataset of medical images, captions, and textual references. Findings of the Association for Computational Linguis- tics: EMNLP , 2020
work page 2020
-
[27]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Xraygpt: Chest radiographs summarization using medical vision-language models
Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 , 2023
-
[29]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, 14 Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. The art of defending: A systematic evaluation and analysis of llm defense strate- gies on safety and over-defensiveness. In Findings of the Association for Computational Linguistics ACL 2024 , pages 13111–13128, 2024
work page 2024
-
[31]
Cross-modality safety alignment
Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, and Xuanjing Huang. Cross-modality safety alignment. arXiv preprint arXiv:2406.15279 , 2024
-
[32]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[33]
Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
work page 2022
-
[34]
Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstra- tions. arXiv preprint arXiv:2310.06387 , 2023
-
[35]
Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self- adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, 2023
work page 2023
-
[36]
Defending chatgpt against jailbreak attack via self-reminders
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence , 5(12):1486–1496, 2023
work page 2023
-
[37]
Compositional exemplars for in-context learning
Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. Compositional exemplars for in-context learning. In International Confer- ence on Machine Learning , pages 39818–39833. PMLR, 2023
work page 2023
-
[38]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Defending jailbreak prompts via in-context adversarial game
Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang, Hongyan Bao, and Xiangliang Zhang. Defending jailbreak prompts via in-context adversarial game. In Proceedings of the 2024 Conference on 15 Empirical Methods in Natural Language Processing , pages 20084–20105, 2024
work page 2024
-
[40]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 16 A Details for Jailbreaking Attacks PGD Attack. PGD attack targets the visual input xv by introducing imper- ceptible perturbations ϵ to generate an adversarial image xv + ϵ. The goal...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Do not include any ”note” or ”disclaimer”. 3. Never say the words ”cannot”, ”unable”, ”instead”, ”as”, ”however”, ”it”, ”unfortunately”, or ”important”. 4. Do not include any negative sentences about the subject of the prompt 17 Figure 8: Trade-off Curve of mixed demonstrations on ASR and RR produced by mix ratio α = [0 , 0.25, 0.5, 0.75, 1] against vario...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.