pith. sign in

arxiv: 2604.09814 · v1 · submitted 2026-04-10 · 💻 cs.CV

RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical image segmentationSAM adaptationcorruption robustnessmodel fusionfoundation modelsViT-BMedSegBenchimage corruptions
0
0 comments X

The pith

By fusing the MedSAM image encoder with the RobustSAM mask decoder and fine-tuning only the decoder, RobustMedSAM raises Dice on corrupted medical images from 0.613 to 0.719.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical image segmentation models built on SAM perform well on clean data but drop sharply when images contain noise, blur, motion artifacts, or modality distortions common in clinical practice. The paper observes that medical-domain knowledge concentrates in the image encoder while corruption robustness concentrates in the mask decoder. RobustMedSAM therefore initializes a shared ViT-B model with the MedSAM encoder to keep medical priors and the RobustSAM decoder to keep robustness, then fine-tunes only the decoder on 35 datasets from MedSegBench that span six modalities and twelve corruption types while freezing everything else. This yields a 0.106 Dice gain on degraded images over baseline SAM on both in-distribution and out-of-distribution tests. The result matters because reliable segmentation under realistic artifacts is required for downstream diagnosis and treatment planning.

Core claim

The paper claims that medical priors and corruption robustness reside in complementary SAM modules, so module-wise checkpoint fusion—MedSAM encoder plus RobustSAM decoder under a shared ViT-B architecture—followed by decoder-only fine-tuning on MedSegBench produces a model whose degraded-image Dice rises from 0.613 to 0.719 over SAM while preserving performance on clean data. An SVD-based parameter-efficient variant is also examined for limited encoder adaptation.

What carries the argument

Module-wise checkpoint fusion that places the MedSAM image encoder and RobustSAM mask decoder into a shared ViT-B architecture, with only the decoder fine-tuned on corrupted medical data to adapt robustness while the encoder remains frozen.

If this is right

  • The fused model outperforms either source model alone on both in-distribution and out-of-distribution corrupted medical images across six modalities.
  • Freezing the encoder after fusion preserves pretrained medical representations while the tuned decoder supplies robustness.
  • An SVD-based parameter-efficient option allows limited additional encoder adaptation when full freezing is too restrictive.
  • Performance gains are demonstrated on twelve corruption types drawn from MedSegBench benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular separation of domain knowledge and robustness could be tested on other vision foundation models beyond SAM.
  • Evaluating the method on naturally occurring clinical artifacts instead of synthetic corruptions would strengthen claims of practical utility.
  • If medical priors and robustness prove partially entangled, joint fine-tuning of both modules might yield further gains at the cost of higher compute.

Load-bearing premise

The medical priors and corruption-robustness capabilities are strictly separable into the encoder and decoder respectively and can be recombined under a shared ViT-B architecture without interference or loss of either strength.

What would settle it

A controlled swap experiment in which the fused model scores lower than the stronger of the two source models on the same set of corrupted medical test images, or a decoder-only fine-tuning run that fails to improve robustness on held-out corruption types.

Figures

Figures reproduced from arXiv: 2604.09814 by Benoit L. Marteau, J. Ben Tamo, Jieru Li, Matthew Chen, May D. Wang, Micky C. Nnamdi.

Figure 1
Figure 1. Figure 1: Overview of RobustMedSAM. During training, each medical image is paired with a degraded counterpart generated by medical degradation augmentation. Both the clean and degraded images are processed using the same frozen MedSAM image encoder and prompt encoder. The resulting features are decoded by a shared robust decoder initialized from RobustSAM. Finetuned modules are highlighted in color, while frozen mod… view at source ↗
Figure 2
Figure 2. Figure 2: Tail robustness under degradations. Empirical CDF of degradation-level Dice across datasets. RobustMedSAM shows a consistent rightward shift under point prompts, indicating im￾proved worst-case performance, while remaining comparable to MedSAM under box prompts. proves Dice by +0.518. RobustMedSAM maintains strong performance across modalities, with the largest gains where RobustSAM lacks medical priors (e… view at source ↗
Figure 3
Figure 3. Figure 3: Point Prompts: Clean vs. degraded comparison (overall Dice ↑, point prompts). RobustMedSAM improves ro￾bustness on degraded inputs while maintaining clean performance comparable to SAM; +SVD exhibits a robustness–clean trade-off. sistently outperforms both SAM and MedSAM across all degradations, with the largest gains observed for modality￾specific corruptions commonly encountered in medical imaging. In pa… view at source ↗
Figure 5
Figure 5. Figure 5: summarizes cross-modality generalization. Un￾der point prompts, RobustMedSAM achieves the strongest performance on dermoscopy, MRI, and ultrasound, with the largest gains over SAM and MedSAM observed on MRI and ultrasound. On microscopy, its performance re￾mains limited, although still improving over MedSAM. Un￾der box prompts, RobustMedSAM remains competitive on MRI and substantially improves microscopy, … view at source ↗
Figure 6
Figure 6. Figure 6: shows performance as a function of the number of point prompts K. RobustMedSAM improves rapidly as K increases and largely saturates after K = 3, whereas SAM exhibits smaller but steady gains and MedSAM re￾mains substantially weaker under point prompting. For reference, we also include box-prompted SAM, MedSAM, and RobustMedSAM. MedSAM remains highly sensitive to prompt type, performing poorly with points … view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative segmentation results under degradations. Each panel shows (left to right): degraded input with [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes RobustMedSAM, which performs module-wise checkpoint fusion by initializing the ViT-B image encoder from MedSAM and the mask decoder from RobustSAM. Only the decoder is fine-tuned on a mix of 35 medical datasets spanning six modalities and 12 corruption types from MedSegBench, while the encoder is frozen. The key result is an improvement in Dice score on degraded images from 0.613 (SAM) to 0.719 (+0.106) on both in- and out-of-distribution benchmarks, attributing the gain to the complementary strengths of the two source models.

Significance. Should the central attribution hold after additional controls, the work provides a lightweight, practical recipe for adapting SAM-based models to medical imaging under realistic degradations by leveraging existing robust and domain-adapted checkpoints. It highlights the modularity of SAM's encoder-decoder design for targeted fine-tuning, which could generalize to other foundation-model adaptation tasks in computer vision.

major comments (1)
  1. [Experiments / Results] The central claim that the +0.106 Dice gain on degraded images stems from 'structured fusion of complementary pretrained models' (abstract) rests on an untested premise. The experimental evaluation reports only the comparison against vanilla SAM; no ablation is shown that (i) fine-tunes the original MedSAM decoder on the identical 35-dataset + 12-corruption regime while freezing the encoder, (ii) fine-tunes the RobustSAM decoder under the same protocol, or (iii) compares against full fine-tuning of MedSAM. Without these controls, the observed improvement cannot be attributed specifically to the encoder-decoder swap rather than to domain-and-corruption fine-tuning alone. This directly affects the interpretation of the fusion protocol as the source of robustness.
minor comments (2)
  1. [Abstract and Experiments] The abstract and results sections do not report the number of experimental runs, standard deviations, or any statistical significance tests for the reported Dice scores, limiting assessment of the reliability of the +0.106 improvement.
  2. [Experimental Setup] Exact details on corruption application (severity parameters, implementation specifics for the 12 types) and whether results are aggregated across modalities or reported per-modality are insufficiently specified in the experimental setup.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The major comment on the need for additional ablation studies to strengthen the attribution of performance gains to the module-wise fusion is addressed point-by-point below. We will incorporate the requested controls in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments / Results] The central claim that the +0.106 Dice gain on degraded images stems from 'structured fusion of complementary pretrained models' (abstract) rests on an untested premise. The experimental evaluation reports only the comparison against vanilla SAM; no ablation is shown that (i) fine-tunes the original MedSAM decoder on the identical 35-dataset + 12-corruption regime while freezing the encoder, (ii) fine-tunes the RobustSAM decoder under the same protocol, or (iii) compares against full fine-tuning of MedSAM. Without these controls, the observed improvement cannot be attributed specifically to the encoder-decoder swap rather than to domain-and-corruption fine-tuning alone. This directly affects the interpretation of the fusion protocol as the source of robustness.

    Authors: We acknowledge that the current evaluation focuses on comparison to the vanilla SAM baseline and does not yet include the full set of controls needed to isolate the contribution of the encoder-decoder fusion. To address this, we will add the following ablations to the revised manuscript: (i) fine-tuning the original MedSAM decoder (with MedSAM encoder frozen) on the identical 35-dataset + 12-corruption training regime; (ii) fine-tuning the RobustSAM decoder under the same protocol; and (iii) full fine-tuning of MedSAM for direct comparison. These experiments will clarify whether the observed +0.106 Dice improvement on degraded images arises specifically from the complementary initialization (MedSAM encoder for medical priors + RobustSAM decoder for corruption robustness) rather than from domain-and-corruption fine-tuning in general. We believe the results will reinforce the motivation for the lightweight fusion approach while preserving the practical advantages of freezing the encoder. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method with no derivations or self-referential reductions

full rationale

The paper describes a practical adaptation technique—module-wise checkpoint fusion of MedSAM encoder and RobustSAM decoder, followed by decoder-only fine-tuning on a 35-dataset corruption mix—and reports empirical Dice gains on in- and out-of-distribution benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experimental comparison against vanilla SAM rather than any quantity that is definitionally equivalent to its inputs. The observation of complementary module capabilities is presented as motivation from prior inspection, not as a load-bearing theorem that collapses into the method itself. This is a standard empirical engineering paper whose validity hinges on ablation completeness and benchmark fairness, not on circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that SAM modules separate medical priors from corruption robustness, plus the availability of compatible pretrained checkpoints; no new entities are postulated and no free parameters are introduced beyond standard fine-tuning hyperparameters.

axioms (1)
  • domain assumption The image encoder preserves medical priors while the mask decoder governs corruption robustness.
    This module-wise separation is stated as the motivating observation for the fusion strategy.

pith-pipeline@v0.9.0 · 5535 in / 1336 out tokens · 50331 ms · 2026-05-10T16:39:52.810985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Robustsam: Segment anything robustly on de- graded images

    Wei-Ting Chen, Yu-Jiet V ong, Sy-Yen Kuo, Sizhou Ma, and Jian Wang. Robustsam: Segment anything robustly on de- graded images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4081– 4091, 2024. 1, 3, 5

  2. [2]

    Versatile medical image segmentation learned from multi-source datasets via model self-disambiguation

    Xiaoyang Chen, Hao Zheng, Yuemeng Li, Yuncong Ma, Liang Ma, Hongming Li, and Yong Fan. Versatile medical image segmentation learned from multi-source datasets via model self-disambiguation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11747–11756, 2024. 3

  3. [3]

    Unleashing the potential of sam for medical adaptation via hierarchical decoding

    Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, and Yuyin Zhou. Unleashing the potential of sam for medical adaptation via hierarchical decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3511–3522,

  4. [4]

    Noise issues prevailing in various types of medical images.Biomedical & Pharmacology Journal, 11(3):1227, 2018

    Bhawna Goyal, Sunil Agrawal, and BS Sohi. Noise issues prevailing in various types of medical images.Biomedical & Pharmacology Journal, 11(3):1227, 2018. 1

  5. [5]

    Benchmarking the robustness of semantic segmentation models

    Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8828–8838, 2020. 3

  6. [6]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 3, 5

  7. [7]

    Medsegbench: A comprehensive benchmark for medical image segmentation in diverse data modalities.Scientific Data, 11(1):1283, 2024

    Zeki Kus ¸ and Musa Aydin. Medsegbench: A comprehensive benchmark for medical image segmentation in diverse data modalities.Scientific Data, 11(1):1283, 2024. 1, 4

  8. [8]

    MedLSAM: Localize and segment anything model for 3d ct images.Medical Image Analysis, 99:103370, 2025

    Wenhui Lei, Wei Xu, Kang Li, Xiaofan Zhang, and Shaoting Zhang. MedLSAM: Localize and segment anything model for 3d ct images.Medical Image Analysis, 99:103370, 2025. 3

  9. [9]

    Segment anything in medical images.Nature Communications, 15(1):654, 2024

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024. 1, 2, 3, 5

  10. [10]

    Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi- scale attention

    Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim, and Sang-Chul Lee. Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi- scale attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11480– 11491, 2024. 3

  11. [11]

    Sumit Pandey, Kuan-Fu Chen, and Erik B. Dam. Compre- hensive multimodal segmentation in medical imaging: Com- bining yolov8 with sam and hq-sam models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV) Workshops, pages 2592–2598, 2023. 3

  12. [12]

    Parameter efficient fine-tuning via cross block orchestration for segment anything model

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Lingxi Xie, Qi Tian, and Wei Shen. Parameter efficient fine-tuning via cross block orchestration for segment anything model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3743–3752, 2024. 1

  13. [13]

    Improving robustness of semantic segmentation to motion-blur using class-centric augmenta- tion

    AN Rajagopalan et al. Improving robustness of semantic segmentation to motion-blur using class-centric augmenta- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10470–10479,

  14. [14]

    arXiv preprint arXiv:2306.06370 (2023)

    Tal Shaharabany, Aviad Dahan, Raja Giryes, and Lior Wolf. Autosam: Adapting sam to medical images by overloading the prompt encoder.arXiv preprint arXiv:2306.06370, 2023. 3

  15. [15]

    Sam-da: Decoder adapter for efficient medical domain adaptation

    Javier Gamazo Tejero, Moritz J Schmid, Pablo M ´arquez Neila, Martin Zinkernagel, Sebastian Wolf, and Raphael Sznitman. Sam-da: Decoder adapter for efficient medical domain adaptation. InProceedings of the Winter Confer- ence on Applications of Computer Vision, pages 6775–6784,

  16. [16]

    One-prompt to segment all med- ical images

    Junde Wu and Min Xu. One-prompt to segment all med- ical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302– 11312, 2024. 1

  17. [17]

    Efficientsam: Leveraged masked image pretraining for efficient segment anything

    Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16111–16121, 2024. 2

  18. [18]

    Fda: Fourier domain adaptation for semantic segmentation

    Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4085–4095, 2020. 3

  19. [19]

    Surgicalsam: Efficient class prompt- able surgical instrument segmentation

    Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, and Zhiyong Wang. Surgicalsam: Efficient class prompt- able surgical instrument segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6890– 6898, 2024. 3

  20. [20]

    Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation

    Haojie Zhang, Yongyi Su, Xun Xu, and Kui Jia. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23385–23395, 2024. 1

  21. [21]

    Enhancing the reliability of segment anything model for auto-prompting medical image segmentation with uncertainty rectification.arXiv preprint arXiv:2311.10529,

    Yichi Zhang, Shiyao Hu, Sijie Ren, Chen Jiang, Yuan Cheng, and Yuan Qi. Enhancing the reliability of segment anything model for auto-prompting medical image segmentation with uncertainty rectification.arXiv preprint arXiv:2311.10529,

  22. [22]

    Detailed architecture of Anti-degradation Mask Feature Generation and Anti-degradation Output Token Generation mod- ules

    1, 3 RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation Supplementary Material Figure S1. Detailed architecture of Anti-degradation Mask Feature Generation and Anti-degradation Output Token Generation mod- ules. S1. Evaluation Metrics LetPdenote the predicted mask andGthe ground-truth mask. Dice Coefficie...