arxiv: 2603.18545 · v2 · submitted 2026-03-19 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen , Fangfang Yang , Chunlei Meng , Yuxian Dong , Ang Li , Yiwei Wei , Jiahuan Long , Jiujiang Guo

show 1 more author

Chengyin Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical vision-language modelsdistribution shiftsrobustness evaluationimage pipeline attacksCLIP-style modelstoken-space adaptationimage quality auditing

0 comments

The pith

CoDA chains clinically plausible medical image shifts to degrade zero-shot accuracy in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoDA as a framework that builds realistic pipeline shifts by composing acquisition shading, reconstruction remapping, display operations, and delivery degradations. It jointly optimizes these stages under masked structural-similarity constraints so the resulting images stay visually plausible yet shift statistics enough to break model behavior. Experiments across brain MRI, chest X-ray, and abdominal CT show that chained compositions reduce zero-shot performance of CLIP-style medical vision-language models more than any isolated stage. The work also tests multimodal large language models as auditors of image realism and finds both proprietary and medical-specific models exhibit persistent high-confidence errors on the shifted samples. A post-hoc repair method using teacher-guided token-space adaptation with patch-level alignment is shown to recover accuracy on the affected images.

Core claim

CoDA jointly optimizes compositions of acquisition-like shading, reconstruction and display remapping, and delivery degradations under masked structural-similarity constraints to induce failures in medical vision-language models while preserving clinical readability; chained compositions degrade zero-shot performance more than any single stage, and a lightweight token-space repair recovers accuracy on the shifted outputs.

What carries the argument

The chain-of-distribution framework that composes and jointly optimizes pipeline stages under masked structural-similarity constraints to create clinically plausible shifts.

Load-bearing premise

The assumption that jointly optimized stage compositions under masked structural-similarity constraints produce shifts that remain visually plausible and clinically readable while still shifting image statistics enough to induce model failures.

What would settle it

An experiment in which no composition of the described acquisition, reconstruction, display, and delivery operations can simultaneously preserve clinical readability and substantially reduce zero-shot MVLM accuracy on brain MRI, chest X-ray, and abdominal CT would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2603.18545 by Ang Li, Chengyin Hu, Chunlei Meng, Fangfang Yang, Jiahuan Long, Jiujiang Guo, Xiang Chen, Yiwei Wei, Yuxian Dong.

**Figure 2.** Figure 2: Qualitative visualization of CoDA composition families. Representative clean and adversarial samples for MRI, X-ray, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Auditing performance under clean vs. CoDA shifts. We compare proprietary and medical-specific MLLMs across MRI, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Post-hoc token-space repair. (a) A frozen teacher optionally guides a lightweight student token adapter trained on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of optimization iterations. Attack success [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation of repair hyperparameters. Robust accu [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDA chains realistic clinical pipeline steps to degrade medical VLMs more than single corruptions and adds a token-space repair, but the images' diagnostic plausibility rests on unvalidated SSIM constraints.

read the letter

The main takeaway is that this work builds attacks by composing acquisition shading, reconstruction remapping, and delivery degradations under masked SSIM constraints, then shows these chains hurt zero-shot CLIP-style MVLM performance across brain MRI, chest X-ray, and abdominal CT more than any isolated stage. It also tests MLLMs as image-quality auditors and offers a teacher-guided token adaptation fix for archived outputs.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CoDA, a chain-of-distribution attack framework that jointly optimizes compositions of acquisition shading, reconstruction remapping, and delivery degradations under masked structural-similarity constraints to generate clinically plausible shifts in medical images. It claims these chained attacks substantially degrade zero-shot performance of CLIP-style medical vision-language models on brain MRI, chest X-ray, and abdominal CT datasets, with chained compositions more damaging than isolated stages. The work further evaluates multimodal LLMs as auditors of imaging realism, finds deficiencies in both proprietary and medical-specific MLLMs, and introduces a post-hoc teacher-guided token-space adaptation repair with patch-level alignment that improves accuracy on CoDA-shifted samples.

Significance. If the quantitative results and clinical validation hold, the work would meaningfully extend robustness evaluation of MVLMs beyond isolated corruptions by characterizing a realistic pipeline-based threat surface. The finding that chained stages are consistently more damaging, combined with the lightweight repair method, could inform deployment practices in radiology multimodal systems. The use of MLLMs for authenticity auditing is a novel angle, though its reliability findings would benefit from stronger baselines.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): The central claims of substantial degradation and repair gains are stated without any reported numerical metrics, baselines, error bars, statistical tests, or dataset sizes in the abstract and appear underspecified in the results; this prevents assessment of effect sizes and reproducibility of the chained-vs-single-stage finding.
[§3.1] §3.1 (CoDA Framework): The claim that jointly optimized compositions remain visually plausible and clinically readable rests solely on masked SSIM constraints without expert radiologist ratings, comparison to real acquisition/reconstruction artifacts, or diagnostic-feature preservation metrics; if non-clinical artifacts are introduced, the 'clinically grounded threat surface' does not follow.

minor comments (2)

[§2] §2 (Related Work): The discussion of prior robustness studies could more explicitly contrast CoDA against existing medical-image corruption benchmarks to clarify novelty.
[Figure 1 and §3.2] Figure 1 and §3.2: The pipeline diagram would benefit from explicit parameter ranges for each stage to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully considered each major comment and outline our responses below. We agree that additional quantitative details and validation steps will strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): The central claims of substantial degradation and repair gains are stated without any reported numerical metrics, baselines, error bars, statistical tests, or dataset sizes in the abstract and appear underspecified in the results; this prevents assessment of effect sizes and reproducibility of the chained-vs-single-stage finding.

Authors: We agree that the abstract and §4 would benefit from explicit numerical reporting to enable assessment of effect sizes and reproducibility. In the revised version, we will update the abstract to report specific accuracy drops (e.g., percentage degradation on brain MRI, chest X-ray, and abdominal CT), baseline comparisons against isolated stages, and repair gains. In §4 we will add error bars across runs, statistical tests (paired t-tests with p-values), and clearly restate dataset sizes and splits. These additions will directly support the chained-vs-single-stage finding. revision: yes
Referee: [§3.1] §3.1 (CoDA Framework): The claim that jointly optimized compositions remain visually plausible and clinically readable rests solely on masked SSIM constraints without expert radiologist ratings, comparison to real acquisition/reconstruction artifacts, or diagnostic-feature preservation metrics; if non-clinical artifacts are introduced, the 'clinically grounded threat surface' does not follow.

Authors: We acknowledge that masked SSIM provides only a computational proxy and does not replace clinical validation. While SSIM is a standard structural-preservation metric, we agree the claim would be stronger with additional evidence. In revision we will add quantitative comparisons of CoDA outputs against real acquisition/reconstruction artifacts drawn from public datasets, include diagnostic-feature metrics such as contrast-to-noise ratio in clinically relevant regions, and explicitly note the absence of expert radiologist ratings as a limitation. Qualitative examples will be expanded in the supplement. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper describes an empirical pipeline for constructing and evaluating distribution shifts via CoDA, relying on optimization under SSIM constraints followed by direct performance measurements on held-out medical imaging datasets. No equations, first-principles derivations, or predictions are claimed; results are obtained by running the described attack and repair procedures on real data. No self-citations are invoked as load-bearing uniqueness theorems, and no fitted parameters are relabeled as independent predictions. The central claims rest on experimental outcomes rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level framework name; optimization of stage parameters is implied but not detailed.

pith-pipeline@v0.9.0 · 5605 in / 1077 out tokens · 51616 ms · 2026-05-15T09:01:35.064353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoDA jointly optimizes stage compositions and within-stage parameters under masked structural-similarity constraints to induce failures while preserving visual plausibility
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

chained compositions are consistently more damaging than any single stage

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining. 2623–2631

work page 2019
[2]

Mohammad Amiriebrahimabadi, Zhina Rouhi, and Najme Mansouri. 2024. A Comprehensive Survey of Multi-Level Thresholding Segmentation Methods for Image Processing.Archives of Computational Methods in Engineering31 (2024), 3647–3697

work page 2024
[3]

Kaplan, Pouya Bashivan, and Irina Rish

Rishika Bhagwatkar, Perampalli Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z. Kaplan, Pouya Bashivan, and Irina Rish. 2024. Towards Adversar- ially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques. InProceedings of the 41st International Conference on Machine Learning

work page 2024
[4]

Rishika Bhagwatkar, Shravan Nayak, Pouya Bashivan, and Irina Rish. 2024. Improving Adversarial Robustness in Vision-Language Models with Architecture and Prompt Design. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 17003–17020

work page 2024
[5]

Nour El Houda Bourai, Hayet Farida Merouani, and Akila Djebbar. 2024. Deep Learning-Assisted Medical Image Compression Challenges and Opportunities: Systematic Review.Neural Computing and Applications36 (2024), 10067–10108

work page 2024
[6]

Wagner, David A

Zijie Cheng, Ariel Yuhan Ong, Siegfried K. Wagner, David A. Merle, Lie Ju, Hanyuan Zhang, Ruinian Chen, Linze Pang, Boxuan Li, Tiantian He, Anran Ran, Hongyang Jiang, Dawei Gabriel Yang, Ke Zou, et al. 2025. Understanding the Robustness of Vision-Language Models to Medical Image Artefacts.npj Digital Medicine8 (2025), 727

work page 2025
[7]

Wonjeong Choi, Sejong Ryu, Jungmoon Lee, Dong-Jun Han, and Jaekyun Moon

work page
[8]

InInternational Conference on Learning Representa- tions

Identifying Robust Neural Pathways: Few-Shot Adversarial Mask Tuning for Vision-Language Models. InInternational Conference on Learning Representa- tions

work page
[9]

Wiest, Carolin V

Jan Clusmann, Dyke Ferber, Isabella C. Wiest, Carolin V. Schneider, Titus J. Brinker, Sebastian Foersch, Daniel Truhn, and Jakob Nikolas Kather. 2025. Prompt Injection Attacks on Vision Language Models in Oncology.Nature Communica- tions16 (2025), 1239

work page 2025
[10]

DeGrave, Joseph D

Alex J. DeGrave, Joseph D. Janizek, and Su-In Lee. 2021. AI for Radiographic COVID-19 Detection Selects Shortcuts over Signal.Nature Machine Intelligence (2021)

work page 2021
[11]

Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Google DeepMind. 2024. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Jianping Gou, Baosheng Yu, Stephen John Maybank, and Dacheng Tao. 2021. Knowledge Distillation: A Survey.International Journal of Computer Vision129, 6 (2021), 1789–1819

work page 2021
[14]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vis...

work page 2024
[15]

Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bohan Zhuang. 2023. Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision

work page 2023
[16]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations

work page 2022
[17]

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, and Zsolt Kira

work page
[18]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3909–3918

work page
[19]

Xijie Huang, Xinyuan Wang, Hantao Zhang, Yinghao Zhu, Jiawen Xi, Jingkun An, Hao Wang, Hao Liang, and Chengwei Pan. 2025. Medical MLLM Is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3797–3805

work page 2025
[20]

Ruinan Jin, Chun-Yin Huang, Chenyu You, and Xiaoxiao Li. 2024. Backdoor Attack on Unpaired Medical Image-Text Foundation Models: A Pilot Study on MedCLIP. InProceedings of the IEEE Conference on Secure and Trustworthy Ma- chine Learning. 272–285

work page 2024
[21]

Muhammad Uzair Khattak, Shahina Kunhimon, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. 2024. UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities.arXiv preprint arXiv:2412.10372(2024)

work page arXiv 2024
[22]

LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, and Yu Rong. 2025. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint...

work page internal anchor Pith review arXiv 2025
[23]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[24]

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. 2022. Scaling and Shifting Your Features: A New Baseline for Efficient Model Tuning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022)

work page 2022
[25]

Chaohu Liu, Gui Tianyi, Yu Liu, and Linli Xu. 2026. AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Opti- mization. InInternational Conference on Learning Representations

work page 2026
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems

work page 2023
[27]

Alejandro Lozano, Min Woo Sun, Ivan Lopez, Jeffrey Gu, Jeffrey J. Nirschl, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Yuhui Zhang, James Burgess, Liangyu Chen, Collin Chiu, Xiaohan Wang, Alfred Seunghoon Song, Robert Tibshirani, and Serena Yeung-Levy. 2025. BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Der...

work page 2025
[28]

Parikh, Jonathan R

Zhixiu Lu, Hailong Li, Nehal A. Parikh, Jonathan R. Dillman, and Lili He. 2025. RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language- Image Pre-training.IEEE Transactions on Neural Networks and Learning Systems (2025). doi:10.1109/TNNLS.2025.3568036

work page doi:10.1109/tnnls.2025.3568036 2025
[29]

Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, and Anurag Arnab

work page
[30]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Time-, Memory- and Parameter-Efficient Visual Adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page
[31]

National Electrical Manufacturers Association. 2024. Digital Imaging and Com- munications in Medicine (DICOM) Standard, PS3.3: Information Object Defini- tions. Standard

work page 2024
[32]

Fahimeh Hosseini Noohdani, Parsa Hosseini, Aryan Yazdan Parast, Hamidreza Yaghoubi Araghi, and Mahdieh Soleymani Baghshah. 2024. Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27662–27671

work page 2024
[33]

OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Con- ference on Machine Learning (Proceedings of Machi...

work page 2021
[35]

Rudie, Hui-Ming Lin, Robyn L

Jeffrey D. Rudie, Hui-Ming Lin, Robyn L. Ball, Sabeena Jalal, Luciano M. Prevedello, Savvas Nicolaou, Brett S. Marinelli, Adam E. Flanders, Kirti Magudia, George Shih, Melissa A. Davis, John Mongan, Peter D. Chang, Ferco H. Berger, Se- bastiaan Hermans, Meng Law, Tyler Richards, Jan-Peter Grunz, Andreas Steven Kunz, Shobhit Mathur, Sandro Galea-Soler, And...

work page 2024
[36]

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein

work page
[37]

InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). 43685–43704

work page
[38]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Ste- fanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lin Yang, et al. 2025. MedGemma Technical Report.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, and Mario Fritz. 2025. Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6842–6850

work page 2025
[40]

Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. 2025. How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs. InComputer Vision – ECCV 2024 (Lecture Notes in Computer Science, Vol. 15109). 37–55

work page 2025
[41]

Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, and Curtis Langlotz. 2024. RaVL: Discovering and Mitigating Spurious Correla- tions in Fine-Tuned Vision-Language Models. InAdvances in Neural Information Processing Systems

work page 2024
[42]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves 9 Xiang Chen et al. Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations

work page 2023
[43]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE Transactions on Image Processing13, 4 (2004), 600–612

work page 2004
[44]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems

work page 2022
[45]

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. 2025. Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14679–14689

work page 2025
[46]

Zech, Marcus A

John R. Zech, Marcus A. Badgeley, Ming Liu, Adam B. Costa, Joseph J. Titano, and Eric K. Oermann. 2018. Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs: A Cross-sectional Study.PLOS Medicine15, 11 (2018), e1002683

work page 2018
[47]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. 2023. BiomedCLIP: A Mul...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Manning, and Curtis P

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2022. Contrastive Learning of Medical Visual Representations from Paired Images and Text. InProceedings of the Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research, Vol. 182). 1–24

work page 2022
[49]

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. 2023. On Evaluating Adversarial Robustness of Large Vision-Language Models. InAdvances in Neural Information Processing Systems

work page 2023
[50]

Hong-Yu Zhou, Xiaoyu Chen, Yinghao Zhang, Ruibang Luo, Liansheng Wang, and Yizhou Yu. 2022. Generalized Radiograph Representation Learning via Cross- supervision between Images and Free-text Radiology Reports.Nature Machine Intelligence(2022). 10

work page 2022