MedFM-Robust: Benchmarking Robustness of Medical Foundation Models
Pith reviewed 2026-05-22 09:16 UTC · model grok-4.3
The pith
Medical foundation models need dedicated testing to hold up under real-world image variations before clinical use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that widespread clinical deployment of medical foundation models requires rigorous evaluation of their reliability under real-world conditions, and that existing models in both the vision-language and segmentation categories must be tested against a new benchmark to reveal where performance breaks down.
What carries the argument
The MedFM-Robust benchmark, which applies controlled real-world variations to medical images and measures resulting drops in performance on tasks such as visual question answering, report generation, visual grounding, and segmentation.
If this is right
- Developers would need to redesign training or add robustness techniques before models can be considered ready for hospitals.
- Hospitals could use benchmark scores to decide which models to adopt for specific imaging tasks.
- Model updates would be evaluated against the same variations to track whether robustness improves over time.
Where Pith is reading between the lines
- The benchmark could become a standard reference point for any new medical AI system, even those not built on foundation models.
- Similar robustness checks might be extended to other medical data types such as time-series signals or text reports.
- If failures cluster around particular image variations, targeted data augmentation during training could be tested as a direct fix.
Load-bearing premise
That existing medical foundation models will exhibit clear performance drops when exposed to the kinds of image variations that occur outside controlled training conditions.
What would settle it
A set of tests in which every evaluated medical foundation model maintains its reported accuracy and segmentation quality when the input images are altered with the real-world variations defined in the benchmark.
Figures
read the original abstract
Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper motivates the need for robustness evaluation of medical foundation models (MedFMs), which it divides into Medical Vision-Language Models (Med-VLMs such as LLaVA-Med, MedGemma, GPT-4o, Gemini) for tasks like VQA and report generation, and segmentation models (SAM adaptations such as SAM-Med2D and MedSAM). It states that widespread clinical deployment necessitates rigorous reliability testing under real-world conditions and positions MedFM-Robust as the benchmark to perform this evaluation.
Significance. A well-designed robustness benchmark for MedFMs could help surface failure modes that affect clinical safety and guide model improvement, given the high stakes of medical imaging applications. The motivation aligns with standard concerns in applied medical ML about distribution shift and deployment reliability.
major comments (2)
- [Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.
- Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.
minor comments (1)
- [Abstract] The model categorization (Med-VLMs vs. segmentation models) is clearly stated but would benefit from a table listing representative models and their primary tasks for quick reference.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We have addressed the major comments by strengthening the motivation with additional citations and expanding the description of the benchmark in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.
Authors: We agree that the motivation would be strengthened by explicit citations and comparisons. In the revised manuscript, we have added references to documented robustness failures, including studies on domain shifts in medical VLMs (e.g., performance degradation across different hospitals and imaging protocols) and segmentation models (e.g., SAM adaptations failing under scanner variations). We also include a direct comparison to existing benchmarks such as MedMNIST and natural-image robustness suites, clarifying the unique gaps MedFM-Robust fills for foundation models in clinical settings. revision: yes
-
Referee: [—] Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.
Authors: We acknowledge that the initial submission could have provided more explicit detail in the main text. The revised manuscript now includes a dedicated Benchmark Construction section describing the datasets (drawn from public sources such as MIMIC-CXR and segmentation collections like KiTS), perturbation types (both synthetic degradations and real-world factors including scanner differences, demographic shifts, and image quality issues), evaluation protocols (including relative performance drop metrics), and baseline results for models such as LLaVA-Med, MedGemma, MedSAM, and SAM-Med2D that demonstrate meaningful robustness gaps. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces MedFM-Robust as a benchmark for robustness evaluation of medical foundation models (Med-VLMs and segmentation adaptations like SAM-Med2D). The abstract and motivation text contain no equations, derivations, fitted parameters, predictions, or load-bearing self-citations. The central claim—that clinical deployment necessitates rigorous real-world reliability evaluation—follows directly from the listed model categories and tasks without any reduction to self-definition, renamed empirical patterns, or imported uniqueness theorems. The derivation chain is absent; the work is a straightforward applied benchmark proposal that remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cheng, J., Ye, J., Deng, Z., Chen, J., Li, T.X., Wang, H., Su, Y., Huang, Z., Chen, J., Jiang, L., Sun, H., He, J., Zhang, S., Zhu, M., Qiao, Y.: Sam-med2d (2023), https://api.semanticscholar.org/CorpusID:261339487
work page 2023
-
[2]
2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (ISBI 2018) pp
Gutman, D.A., Codella, N.C.F., Celebi, M.E., Helba, B., Marchetti, M.A., Mishra, N.K., Halpern, A.C.: Skin lesion analysis toward melanoma detection: A chal- lenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). 2018 IEEE 15th Inter- national Symposium on Biomedical Imaging (...
work page 2017
-
[3]
Hendrycks, D., Dietterich, T.G.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net (2019),https://openreview.net/forum?id=HJz6tiCqYm
work page 2019
-
[4]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
work page 2022
-
[5]
OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[6]
Hu, Y., Li, T.X., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scalecomprehensiveevaluationbenchmarkformedicallvlm.2024IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR)pp.22170–22183 (2024),https://api.semanticscholar.org/CorpusID:267657686
work page 2024
-
[7]
Huang, X., Shen, L., Liu, J., Shang, F., Li, H., Huang, H., Yang, Y.: Towards a multimodal large language model with pixel-level insight for biomedicine. In: AAAI Conference on Artificial Intelligence (2024),https://api.semanticscholar.org/ CorpusID:274655737
work page 2024
-
[8]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D
Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: Ro, Y.M., Cheng, W., Kim, J., Chu, W., Cui, P., Choi, J., Hu, M., Neve, W.D. (eds.) MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part II. Lectu...
-
[10]
2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.B.: Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 3992– 4003 (2023),https://api.semanticscholar.org/CorpusID:257952310
work page 2023
-
[11]
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXivabs/2306.00890(2023),https://api.semantic scholar.org/CorpusID:258999820
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203
Ma, J., He, Y., Li, F., Han, L.J., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(2023),https://api.semanticscholar.org/ CorpusID:260431203
work page 2023
-
[13]
Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11
Moor, M., Banerjee, O., Abad, Z.S.H., Krumholz, H.M., Leskovec, J., Topol, E.J., Rajpurkar, P.: Foundation models for generalist medical artificial intelligence. Na- ture (2023) MedFM-Robust: Benchmarking Robustness of Medical Foundation Models 11
work page 2023
-
[14]
OpenAI: Gpt-4v(ision) system card (2023)
work page 2023
-
[15]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
work page 2002
-
[16]
Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.: Radiology objects in context(roco):Amultimodalimagedataset.In:CVII-STENT/LABELS@MICCAI (2018),https://api.semanticscholar.org/CorpusID:53087891
work page 2018
-
[17]
Reinhold, J.C., Dewey, B.E., Carass, A., Prince, J.L.: Evaluating the impact of intensity normalization on mr image synthesis. Proceedings of SPIE–the Interna- tional Society for Optical Engineering10949(2018),https://api.semanticscho lar.org/CorpusID:54473002
work page 2018
-
[18]
Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report (2025),https://api.semanticscholar.org/CorpusID:280150648
work page 2025
-
[19]
semanticscholar.org/CorpusID:264848844
Suetens, P.: Fundamentals of medical imaging, 3rd edition (2017),https://api. semanticscholar.org/CorpusID:264848844
work page 2017
-
[20]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444
Tellez, D., Litjens, G.J.S., Bándi, P., Bulten, W., Bokhorst, J.M., Ciompi, F., van der Laak, J.: Quantifying the effects of data augmentation and stain color nor- malization in convolutional neural networks for computational pathology. Medical image analysis58, 101544 (2019),https://api.semanticscholar.org/CorpusID: 62841444
work page 2019
-
[22]
Thirunavukarasu, A., Ting, D., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.: Large language models in medicine. Nature medicine (2023)
work page 2023
-
[23]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)
work page 2015
-
[24]
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Pro- cessing13, 600–612 (2004),https://api.semanticscholar.org/CorpusID: 207761262
work page 2004
-
[25]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhao, X., Huang, W., Wang, X., Zhao, H., Zhuang, L., Jiang, A., Wan, G., Ye, M.: Divide, conquer and unite: Hierarchical style-recalibrated prototype alignment for federated medical segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 28760–28768 (2026)
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.