MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Lijie Hu; Lu Yin; Tianjin Huang; Xiangxiang Cui; Yifang Wang

arxiv: 2605.19027 · v2 · pith:RVLIERM3new · submitted 2026-05-18 · 💻 cs.CV

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Xiangxiang Cui , Tianjin Huang , Yifang Wang , Lijie Hu , Lu Yin This is my paper

Pith reviewed 2026-05-22 09:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical foundation modelsrobustness evaluationvision-language modelsmedical image segmentationclinical reliabilityreal-world variationsbenchmarking

0 comments

The pith

Medical foundation models need dedicated testing to hold up under real-world image variations before clinical use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create and apply a benchmark that measures how well medical foundation models cope with the kinds of image shifts and conditions encountered in actual healthcare settings. These models fall into two groups: vision-language models that handle tasks such as answering questions about scans or generating reports, and segmentation models that outline structures in images. A sympathetic reader would care because the models are already being positioned for broad medical use, yet any drop in reliability when data looks different from training examples could affect diagnosis or treatment decisions. The work therefore supplies a structured way to expose and compare those reliability gaps across both specialized and general-purpose models.

Core claim

The central claim is that widespread clinical deployment of medical foundation models requires rigorous evaluation of their reliability under real-world conditions, and that existing models in both the vision-language and segmentation categories must be tested against a new benchmark to reveal where performance breaks down.

What carries the argument

The MedFM-Robust benchmark, which applies controlled real-world variations to medical images and measures resulting drops in performance on tasks such as visual question answering, report generation, visual grounding, and segmentation.

If this is right

Developers would need to redesign training or add robustness techniques before models can be considered ready for hospitals.
Hospitals could use benchmark scores to decide which models to adopt for specific imaging tasks.
Model updates would be evaluated against the same variations to track whether robustness improves over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could become a standard reference point for any new medical AI system, even those not built on foundation models.
Similar robustness checks might be extended to other medical data types such as time-series signals or text reports.
If failures cluster around particular image variations, targeted data augmentation during training could be tested as a direct fix.

Load-bearing premise

That existing medical foundation models will exhibit clear performance drops when exposed to the kinds of image variations that occur outside controlled training conditions.

What would settle it

A set of tests in which every evaluated medical foundation model maintains its reported accuracy and segmentation quality when the input images are altered with the real-world variations defined in the benchmark.

Figures

Figures reproduced from arXiv: 2605.19027 by Lijie Hu, Lu Yin, Tianjin Huang, Xiangxiang Cui, Yifang Wang.

**Figure 2.** Figure 2: Overview of our robustness evaluation framework. We generate SSIM-calibrated perturbations across five severity levels, combining base corruptions with modalityspecific artifacts. We benchmark three Med-VLMs and two SAM-based segmentation models under a unified protocol, and investigate multiple fine-tuning strategies across VQA, captioning, visual grounding, and segmentation tasks. SSIM-Guided Severity C… view at source ↗

**Figure 3.** Figure 3: Comprehensive robustness evaluation of medical image segmentation models and VLMs under perturbations. Left (Segmentation): (a) Performance-robustness trade-off. (b) Strategy ranking. (c) Model comparison. (d) Dataset sensitivity. (e) Top 15 perturbation types. (f) Severity level impact. Right (VLMs): (g-i) Clean vs. perturbed performance on VQA, Grounding, and Captioning. (j-l) Perturbation impact. 3 Exp… view at source ↗

read the original abstract

Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces MedFM-Robust as a benchmark for robustness of medical foundation models but stays mostly at the motivation stage without methods or results.

read the letter

The main thing to know is that this paper proposes MedFM-Robust to evaluate how medical foundation models hold up under real-world conditions, covering both Med-VLMs for tasks like VQA and report generation and SAM-based segmentation models. It ties this directly to the risks of clinical deployment. The motivation is straightforward and reasonable given the stakes for patient outcomes. The paper does a decent job listing concrete model examples and separating the two paradigms clearly. That framing gives a useful starting point for thinking about safety in medical AI. The soft spots are clear. The text provides no description of the benchmark itself, no details on perturbations, datasets, or evaluation protocols, and no results or comparisons to prior robustness work. This leaves the claim that a new benchmark is needed somewhat open, since it is not yet shown how this one differs in practice from standard CV robustness tests applied to medical data. The argument follows logically from the listed models and tasks without obvious internal contradictions. This paper is aimed at researchers working on medical AI deployment and evaluation. A reader interested in safety protocols might pick up ideas from the motivation, but anyone expecting a complete benchmark study with data or experiments will find it thin. I would send it for peer review because the topic is timely and the basic position holds up, even if the full version needs substantial additions on the actual benchmark and findings to be publishable.

Referee Report

2 major / 1 minor

Summary. The paper motivates the need for robustness evaluation of medical foundation models (MedFMs), which it divides into Medical Vision-Language Models (Med-VLMs such as LLaVA-Med, MedGemma, GPT-4o, Gemini) for tasks like VQA and report generation, and segmentation models (SAM adaptations such as SAM-Med2D and MedSAM). It states that widespread clinical deployment necessitates rigorous reliability testing under real-world conditions and positions MedFM-Robust as the benchmark to perform this evaluation.

Significance. A well-designed robustness benchmark for MedFMs could help surface failure modes that affect clinical safety and guide model improvement, given the high stakes of medical imaging applications. The motivation aligns with standard concerns in applied medical ML about distribution shift and deployment reliability.

major comments (2)

[Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.
Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.

minor comments (1)

[Abstract] The model categorization (Med-VLMs vs. segmentation models) is clearly stated but would benefit from a table listing representative models and their primary tasks for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We have addressed the major comments by strengthening the motivation with additional citations and expanding the description of the benchmark in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.

Authors: We agree that the motivation would be strengthened by explicit citations and comparisons. In the revised manuscript, we have added references to documented robustness failures, including studies on domain shifts in medical VLMs (e.g., performance degradation across different hospitals and imaging protocols) and segmentation models (e.g., SAM adaptations failing under scanner variations). We also include a direct comparison to existing benchmarks such as MedMNIST and natural-image robustness suites, clarifying the unique gaps MedFM-Robust fills for foundation models in clinical settings. revision: yes
Referee: [—] Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.

Authors: We acknowledge that the initial submission could have provided more explicit detail in the main text. The revised manuscript now includes a dedicated Benchmark Construction section describing the datasets (drawn from public sources such as MIMIC-CXR and segmentation collections like KiTS), perturbation types (both synthetic degradations and real-world factors including scanner differences, demographic shifts, and image quality issues), evaluation protocols (including relative performance drop metrics), and baseline results for models such as LLaVA-Med, MedGemma, MedSAM, and SAM-Med2D that demonstrate meaningful robustness gaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MedFM-Robust as a benchmark for robustness evaluation of medical foundation models (Med-VLMs and segmentation adaptations like SAM-Med2D). The abstract and motivation text contain no equations, derivations, fitted parameters, predictions, or load-bearing self-citations. The central claim—that clinical deployment necessitates rigorous real-world reliability evaluation—follows directly from the listed model categories and tasks without any reduction to self-definition, renamed empirical patterns, or imported uniqueness theorems. The derivation chain is absent; the work is a straightforward applied benchmark proposal that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no free parameters, axioms, or invented entities; the text relies on general domain knowledge about clinical deployment risks.

pith-pipeline@v0.9.0 · 5666 in / 863 out tokens · 26556 ms · 2026-05-22T09:16:20.582154+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Cheng, J., Ye, J., Deng, Z., Chen, J., Li, T.X., Wang, H., Su, Y., Huang, Z., Chen, J., Jiang, L., Sun, H., He, J., Zhang, S., Zhu, M., Qiao, Y.: Sam-med2d (2023), https://api.semanticscholar.org/CorpusID:261339487

work page 2023
[2]

2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (ISBI 2018) pp

Gutman, D.A., Codella, N.C.F., Celebi, M.E., Helba, B., Marchetti, M.A., Mishra, N.K., Halpern, A.C.: Skin lesion analysis toward melanoma detection: A chal- lenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). 2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (...

work page 2017
[3]

In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019

Hendrycks, D., Dietterich, T.G.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net (2019),https://openreview.net/forum?id=HJz6tiCqYm

work page 2019
[4]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page 2022
[5]

OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[6]

Hu, Y., Li, T.X., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scalecomprehensiveevaluationbenchmarkformedicallvlm.2024IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR)pp.22170–22183 (2024),https://api.semanticscholar.org/CorpusID:267657686

work page 2024
[7]

In: AAAI Conference on Artificial Intelligence (2024),https://api.semanticscholar.org/ CorpusID:274655737

Huang, X., Shen, L., Liu, J., Shang, F., Li, H., Huang, H., Yang, Y.: Towards a multimodal large language model with pixel-level insight for biomedicine. In: AAAI Conference on Artificial Intelligence (2024),https://api.semanticscholar.org/ CorpusID:274655737

work page 2024
[8]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D

Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D. (eds.) MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part II. Lectu...

work page doi:10.1007/978-3-030-377 2020
[10]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.B.: Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 3992– 4003 (2023),https://api.semanticscholar.org/CorpusID:257952310

work page 2023
[11]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXivabs/2306.00890(2023),https://api.semantic scholar.org/CorpusID:258999820

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203

Ma, J., He, Y., Li, F., Han, L.J., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203

work page 2023
[13]

Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11

Moor, M., Banerjee, O., Abad, Z.S.H., Krumholz, H.M., Leskovec, J., Topol, E.J., Rajpurkar, P.: Foundation models for generalist medical artificial intelligence. Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11

work page 2023
[14]

OpenAI: Gpt-4v(ision) system card (2023)

work page 2023
[15]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

work page 2002
[16]

Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.: Radiology objects in context(roco):Amultimodalimagedataset.In:CVII-STENT/LABELS@MICCAI (2018),https://api.semanticscholar.org/CorpusID:53087891

work page 2018
[17]

Proceedings of SPIE–the Interna- tional Society for Optical Engineering10949(2018),https://api.semanticscho lar.org/CorpusID:54473002

Reinhold, J.C., Dewey, B.E., Carass, A., Prince, J.L.: Evaluating the impact of intensity normalization on mr image synthesis. Proceedings of SPIE–the Interna- tional Society for Optical Engineering10949(2018),https://api.semanticscho lar.org/CorpusID:54473002

work page 2018
[18]

Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report (2025),https://api.semanticscholar.org/CorpusID:280150648

work page 2025
[19]

semanticscholar.org/CorpusID:264848844

Suetens, P.: Fundamentals of medical imaging, 3rd edition (2017),https://api. semanticscholar.org/CorpusID:264848844

work page 2017
[20]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444

Tellez, D., Litjens, G.J.S., Bándi, P., Bulten, W., Bokhorst, J.M., Ciompi, F., van der Laak, J.: Quantifying the effects of data augmentation and stain color nor- malization in convolutional neural networks for computational pathology. Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444

work page 2019
[22]

Nature medicine (2023)

Thirunavukarasu, A., Ting, D., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.: Large language models in medicine. Nature medicine (2023)

work page 2023
[23]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

work page 2015
[24]

IEEE Transactions on Image Pro- cessing13, 600–612 (2004),https://api.semanticscholar.org/CorpusID: 207761262

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Pro- cessing13, 600–612 (2004),https://api.semanticscholar.org/CorpusID: 207761262

work page 2004
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhao, X., Huang, W., Wang, X., Zhao, H., Zhuang, L., Jiang, A., Wan, G., Ye, M.: Divide, conquer and unite: Hierarchical style-recalibrated prototype alignment for federated medical segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 28760–28768 (2026)

work page 2026

[1] [1]

Cheng, J., Ye, J., Deng, Z., Chen, J., Li, T.X., Wang, H., Su, Y., Huang, Z., Chen, J., Jiang, L., Sun, H., He, J., Zhang, S., Zhu, M., Qiao, Y.: Sam-med2d (2023), https://api.semanticscholar.org/CorpusID:261339487

work page 2023

[2] [2]

2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (ISBI 2018) pp

Gutman, D.A., Codella, N.C.F., Celebi, M.E., Helba, B., Marchetti, M.A., Mishra, N.K., Halpern, A.C.: Skin lesion analysis toward melanoma detection: A chal- lenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). 2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (...

work page 2017

[3] [3]

In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019

Hendrycks, D., Dietterich, T.G.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net (2019),https://openreview.net/forum?id=HJz6tiCqYm

work page 2019

[4] [4]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page 2022

[5] [5]

OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[6] [6]

Hu, Y., Li, T.X., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scalecomprehensiveevaluationbenchmarkformedicallvlm.2024IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR)pp.22170–22183 (2024),https://api.semanticscholar.org/CorpusID:267657686

work page 2024

[7] [7]

In: AAAI Conference on Artificial Intelligence (2024),https://api.semanticscholar.org/ CorpusID:274655737

Huang, X., Shen, L., Liu, J., Shang, F., Li, H., Huang, H., Yang, Y.: Towards a multimodal large language model with pixel-level insight for biomedicine. In: AAAI Conference on Artificial Intelligence (2024),https://api.semanticscholar.org/ CorpusID:274655737

work page 2024

[8] [8]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D

Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D. (eds.) MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part II. Lectu...

work page doi:10.1007/978-3-030-377 2020

[10] [10]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.B.: Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 3992– 4003 (2023),https://api.semanticscholar.org/CorpusID:257952310

work page 2023

[11] [11]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXivabs/2306.00890(2023),https://api.semantic scholar.org/CorpusID:258999820

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203

Ma, J., He, Y., Li, F., Han, L.J., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203

work page 2023

[13] [13]

Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11

Moor, M., Banerjee, O., Abad, Z.S.H., Krumholz, H.M., Leskovec, J., Topol, E.J., Rajpurkar, P.: Foundation models for generalist medical artificial intelligence. Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11

work page 2023

[14] [14]

OpenAI: Gpt-4v(ision) system card (2023)

work page 2023

[15] [15]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

work page 2002

[16] [16]

Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.: Radiology objects in context(roco):Amultimodalimagedataset.In:CVII-STENT/LABELS@MICCAI (2018),https://api.semanticscholar.org/CorpusID:53087891

work page 2018

[17] [17]

Proceedings of SPIE–the Interna- tional Society for Optical Engineering10949(2018),https://api.semanticscho lar.org/CorpusID:54473002

Reinhold, J.C., Dewey, B.E., Carass, A., Prince, J.L.: Evaluating the impact of intensity normalization on mr image synthesis. Proceedings of SPIE–the Interna- tional Society for Optical Engineering10949(2018),https://api.semanticscho lar.org/CorpusID:54473002

work page 2018

[18] [18]

Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report (2025),https://api.semanticscholar.org/CorpusID:280150648

work page 2025

[19] [19]

semanticscholar.org/CorpusID:264848844

Suetens, P.: Fundamentals of medical imaging, 3rd edition (2017),https://api. semanticscholar.org/CorpusID:264848844

work page 2017

[20] [20]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444

Tellez, D., Litjens, G.J.S., Bándi, P., Bulten, W., Bokhorst, J.M., Ciompi, F., van der Laak, J.: Quantifying the effects of data augmentation and stain color nor- malization in convolutional neural networks for computational pathology. Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444

work page 2019

[22] [22]

Nature medicine (2023)

Thirunavukarasu, A., Ting, D., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.: Large language models in medicine. Nature medicine (2023)

work page 2023

[23] [23]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

work page 2015

[24] [24]

IEEE Transactions on Image Pro- cessing13, 600–612 (2004),https://api.semanticscholar.org/CorpusID: 207761262

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Pro- cessing13, 600–612 (2004),https://api.semanticscholar.org/CorpusID: 207761262

work page 2004

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhao, X., Huang, W., Wang, X., Zhao, H., Zhuang, L., Jiang, A., Wan, G., Ye, M.: Divide, conquer and unite: Hierarchical style-recalibrated prototype alignment for federated medical segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 28760–28768 (2026)

work page 2026