RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation
Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3
The pith
By fusing the MedSAM image encoder with the RobustSAM mask decoder and fine-tuning only the decoder, RobustMedSAM raises Dice on corrupted medical images from 0.613 to 0.719.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that medical priors and corruption robustness reside in complementary SAM modules, so module-wise checkpoint fusion—MedSAM encoder plus RobustSAM decoder under a shared ViT-B architecture—followed by decoder-only fine-tuning on MedSegBench produces a model whose degraded-image Dice rises from 0.613 to 0.719 over SAM while preserving performance on clean data. An SVD-based parameter-efficient variant is also examined for limited encoder adaptation.
What carries the argument
Module-wise checkpoint fusion that places the MedSAM image encoder and RobustSAM mask decoder into a shared ViT-B architecture, with only the decoder fine-tuned on corrupted medical data to adapt robustness while the encoder remains frozen.
If this is right
- The fused model outperforms either source model alone on both in-distribution and out-of-distribution corrupted medical images across six modalities.
- Freezing the encoder after fusion preserves pretrained medical representations while the tuned decoder supplies robustness.
- An SVD-based parameter-efficient option allows limited additional encoder adaptation when full freezing is too restrictive.
- Performance gains are demonstrated on twelve corruption types drawn from MedSegBench benchmarks.
Where Pith is reading between the lines
- The same modular separation of domain knowledge and robustness could be tested on other vision foundation models beyond SAM.
- Evaluating the method on naturally occurring clinical artifacts instead of synthetic corruptions would strengthen claims of practical utility.
- If medical priors and robustness prove partially entangled, joint fine-tuning of both modules might yield further gains at the cost of higher compute.
Load-bearing premise
The medical priors and corruption-robustness capabilities are strictly separable into the encoder and decoder respectively and can be recombined under a shared ViT-B architecture without interference or loss of either strength.
What would settle it
A controlled swap experiment in which the fused model scores lower than the stronger of the two source models on the same set of corrupted medical test images, or a decoder-only fine-tuning run that fails to improve robustness on held-out corruption types.
Figures
read the original abstract
Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RobustMedSAM, which performs module-wise checkpoint fusion by initializing the ViT-B image encoder from MedSAM and the mask decoder from RobustSAM. Only the decoder is fine-tuned on a mix of 35 medical datasets spanning six modalities and 12 corruption types from MedSegBench, while the encoder is frozen. The key result is an improvement in Dice score on degraded images from 0.613 (SAM) to 0.719 (+0.106) on both in- and out-of-distribution benchmarks, attributing the gain to the complementary strengths of the two source models.
Significance. Should the central attribution hold after additional controls, the work provides a lightweight, practical recipe for adapting SAM-based models to medical imaging under realistic degradations by leveraging existing robust and domain-adapted checkpoints. It highlights the modularity of SAM's encoder-decoder design for targeted fine-tuning, which could generalize to other foundation-model adaptation tasks in computer vision.
major comments (1)
- [Experiments / Results] The central claim that the +0.106 Dice gain on degraded images stems from 'structured fusion of complementary pretrained models' (abstract) rests on an untested premise. The experimental evaluation reports only the comparison against vanilla SAM; no ablation is shown that (i) fine-tunes the original MedSAM decoder on the identical 35-dataset + 12-corruption regime while freezing the encoder, (ii) fine-tunes the RobustSAM decoder under the same protocol, or (iii) compares against full fine-tuning of MedSAM. Without these controls, the observed improvement cannot be attributed specifically to the encoder-decoder swap rather than to domain-and-corruption fine-tuning alone. This directly affects the interpretation of the fusion protocol as the source of robustness.
minor comments (2)
- [Abstract and Experiments] The abstract and results sections do not report the number of experimental runs, standard deviations, or any statistical significance tests for the reported Dice scores, limiting assessment of the reliability of the +0.106 improvement.
- [Experimental Setup] Exact details on corruption application (severity parameters, implementation specifics for the 12 types) and whether results are aggregated across modalities or reported per-modality are insufficiently specified in the experimental setup.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The major comment on the need for additional ablation studies to strengthen the attribution of performance gains to the module-wise fusion is addressed point-by-point below. We will incorporate the requested controls in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments / Results] The central claim that the +0.106 Dice gain on degraded images stems from 'structured fusion of complementary pretrained models' (abstract) rests on an untested premise. The experimental evaluation reports only the comparison against vanilla SAM; no ablation is shown that (i) fine-tunes the original MedSAM decoder on the identical 35-dataset + 12-corruption regime while freezing the encoder, (ii) fine-tunes the RobustSAM decoder under the same protocol, or (iii) compares against full fine-tuning of MedSAM. Without these controls, the observed improvement cannot be attributed specifically to the encoder-decoder swap rather than to domain-and-corruption fine-tuning alone. This directly affects the interpretation of the fusion protocol as the source of robustness.
Authors: We acknowledge that the current evaluation focuses on comparison to the vanilla SAM baseline and does not yet include the full set of controls needed to isolate the contribution of the encoder-decoder fusion. To address this, we will add the following ablations to the revised manuscript: (i) fine-tuning the original MedSAM decoder (with MedSAM encoder frozen) on the identical 35-dataset + 12-corruption training regime; (ii) fine-tuning the RobustSAM decoder under the same protocol; and (iii) full fine-tuning of MedSAM for direct comparison. These experiments will clarify whether the observed +0.106 Dice improvement on degraded images arises specifically from the complementary initialization (MedSAM encoder for medical priors + RobustSAM decoder for corruption robustness) rather than from domain-and-corruption fine-tuning in general. We believe the results will reinforce the motivation for the lightweight fusion approach while preserving the practical advantages of freezing the encoder. revision: yes
Circularity Check
No circularity: purely empirical method with no derivations or self-referential reductions
full rationale
The paper describes a practical adaptation technique—module-wise checkpoint fusion of MedSAM encoder and RobustSAM decoder, followed by decoder-only fine-tuning on a 35-dataset corruption mix—and reports empirical Dice gains on in- and out-of-distribution benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experimental comparison against vanilla SAM rather than any quantity that is definitionally equivalent to its inputs. The observation of complementary module capabilities is presented as motivation from prior inspection, not as a load-bearing theorem that collapses into the method itself. This is a standard empirical engineering paper whose validity hinges on ablation completeness and benchmark fairness, not on circular reasoning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The image encoder preserves medical priors while the mask decoder governs corruption robustness.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture... fine-tune only the mask decoder on 35 medical datasets... clean–degraded cross-branch alignment
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the image encoder preserves medical priors, while the mask decoder governs corruption robustness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Robustsam: Segment anything robustly on de- graded images
Wei-Ting Chen, Yu-Jiet V ong, Sy-Yen Kuo, Sizhou Ma, and Jian Wang. Robustsam: Segment anything robustly on de- graded images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4081– 4091, 2024. 1, 3, 5
work page 2024
-
[2]
Xiaoyang Chen, Hao Zheng, Yuemeng Li, Yuncong Ma, Liang Ma, Hongming Li, and Yong Fan. Versatile medical image segmentation learned from multi-source datasets via model self-disambiguation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11747–11756, 2024. 3
work page 2024
-
[3]
Unleashing the potential of sam for medical adaptation via hierarchical decoding
Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, and Yuyin Zhou. Unleashing the potential of sam for medical adaptation via hierarchical decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3511–3522,
-
[4]
Bhawna Goyal, Sunil Agrawal, and BS Sohi. Noise issues prevailing in various types of medical images.Biomedical & Pharmacology Journal, 11(3):1227, 2018. 1
work page 2018
-
[5]
Benchmarking the robustness of semantic segmentation models
Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8828–8838, 2020. 3
work page 2020
-
[6]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 3, 5
work page 2023
-
[7]
Zeki Kus ¸ and Musa Aydin. Medsegbench: A comprehensive benchmark for medical image segmentation in diverse data modalities.Scientific Data, 11(1):1283, 2024. 1, 4
work page 2024
-
[8]
Wenhui Lei, Wei Xu, Kang Li, Xiaofan Zhang, and Shaoting Zhang. MedLSAM: Localize and segment anything model for 3d ct images.Medical Image Analysis, 99:103370, 2025. 3
work page 2025
-
[9]
Segment anything in medical images.Nature Communications, 15(1):654, 2024
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024. 1, 2, 3, 5
work page 2024
-
[10]
Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim, and Sang-Chul Lee. Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi- scale attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11480– 11491, 2024. 3
work page 2024
-
[11]
Sumit Pandey, Kuan-Fu Chen, and Erik B. Dam. Compre- hensive multimodal segmentation in medical imaging: Com- bining yolov8 with sam and hq-sam models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV) Workshops, pages 2592–2598, 2023. 3
work page 2023
-
[12]
Parameter efficient fine-tuning via cross block orchestration for segment anything model
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Lingxi Xie, Qi Tian, and Wei Shen. Parameter efficient fine-tuning via cross block orchestration for segment anything model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3743–3752, 2024. 1
work page 2024
-
[13]
Improving robustness of semantic segmentation to motion-blur using class-centric augmenta- tion
AN Rajagopalan et al. Improving robustness of semantic segmentation to motion-blur using class-centric augmenta- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10470–10479,
-
[14]
arXiv preprint arXiv:2306.06370 (2023)
Tal Shaharabany, Aviad Dahan, Raja Giryes, and Lior Wolf. Autosam: Adapting sam to medical images by overloading the prompt encoder.arXiv preprint arXiv:2306.06370, 2023. 3
-
[15]
Sam-da: Decoder adapter for efficient medical domain adaptation
Javier Gamazo Tejero, Moritz J Schmid, Pablo M ´arquez Neila, Martin Zinkernagel, Sebastian Wolf, and Raphael Sznitman. Sam-da: Decoder adapter for efficient medical domain adaptation. InProceedings of the Winter Confer- ence on Applications of Computer Vision, pages 6775–6784,
-
[16]
One-prompt to segment all med- ical images
Junde Wu and Min Xu. One-prompt to segment all med- ical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302– 11312, 2024. 1
work page 2024
-
[17]
Efficientsam: Leveraged masked image pretraining for efficient segment anything
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16111–16121, 2024. 2
work page 2024
-
[18]
Fda: Fourier domain adaptation for semantic segmentation
Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4085–4095, 2020. 3
work page 2020
-
[19]
Surgicalsam: Efficient class prompt- able surgical instrument segmentation
Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, and Zhiyong Wang. Surgicalsam: Efficient class prompt- able surgical instrument segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6890– 6898, 2024. 3
work page 2024
-
[20]
Haojie Zhang, Yongyi Su, Xun Xu, and Kui Jia. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23385–23395, 2024. 1
work page 2024
-
[21]
Yichi Zhang, Shiyao Hu, Sijie Ren, Chen Jiang, Yuan Cheng, and Yuan Qi. Enhancing the reliability of segment anything model for auto-prompting medical image segmentation with uncertainty rectification.arXiv preprint arXiv:2311.10529,
-
[22]
1, 3 RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation Supplementary Material Figure S1. Detailed architecture of Anti-degradation Mask Feature Generation and Anti-degradation Output Token Generation mod- ules. S1. Evaluation Metrics LetPdenote the predicted mask andGthe ground-truth mask. Dice Coefficie...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.