Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters
Pith reviewed 2026-05-09 15:29 UTC · model grok-4.3
The pith
Dual-adapter fine-tuning adapts SAM2 for prompt-free biomedical segmentation with major efficiency gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a prompt-free parameter-efficient fine-tuning framework with a convolutional Positional Encoding Generator for arbitrary aspect ratios and a dual-adapter strategy (High-Performance Adapter using deformable convolutions for precise boundary modeling and Lightweight Adapter using structural re-parameterization to minimize inference latency) enables effective multi-class semantic segmentation on variable-sized biomedical inputs, significantly outperforming vanilla SAM2 and prior heavyweight medical adaptations on the ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets.
What carries the argument
Dual-adapter strategy with High-Performance Adapter using deformable convolutions for boundary modeling and Lightweight Adapter using structural re-parameterization for low latency, together with convolutional Positional Encoding Generator to handle variable aspect ratios.
If this is right
- The method supports multi-class segmentation on biomedical images of arbitrary sizes without manual prompts.
- It delivers up to 19.66 percent higher accuracy than vanilla SAM2 across the tested datasets.
- Computational costs drop by approximately 87 percent compared with heavyweight medical SAM adaptations.
- The re-parameterized lightweight adapter enables lower-latency inference suitable for clinical use.
Where Pith is reading between the lines
- The same dual-adapter pattern could transfer to adapting other vision foundation models to specialized domains such as satellite or industrial imagery.
- Reduced compute requirements may allow SAM2-based segmentation on standard hospital workstations without dedicated high-end GPUs.
- Combining the adapters with further efficiency techniques like quantization could yield additional speed improvements for real-time applications.
Load-bearing premise
The dual-adapter design with deformable convolutions and re-parameterization will maintain its accuracy-efficiency trade-off when applied to new biomedical datasets or tasks beyond the four evaluated.
What would settle it
Testing the adapted model on an additional biomedical dataset such as brain tumor segmentation and checking whether the reported accuracy gain over vanilla SAM2 and the 87 percent computational reduction are preserved.
Figures
read the original abstract
Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66\% over the vanilla SAM2, while reducing computational costs by approximately 87\% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a prompt-free, parameter-efficient fine-tuning framework for adapting SAM2 to multi-class biomedical semantic segmentation on variable-sized inputs. It introduces a convolutional Positional Encoding Generator and a dual-adapter strategy consisting of a High-Performance Adapter (deformable convolutions for boundary modeling) and a Lightweight Adapter (structural re-parameterization for reduced inference latency). Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC report up to 19.66% accuracy gains over vanilla SAM2 and ~87% compute reduction versus prior heavyweight medical SAM adaptations.
Significance. If the empirical claims hold with fuller validation, the work provides a practical accuracy-efficiency trade-off for adapting large foundation models like SAM2 to biomedical domains without prompts or heavy compute, leveraging public benchmarks for reproducibility. The dual-adapter design and re-parameterization approach could inform efficient adaptation strategies in medical imaging if generalization is demonstrated.
major comments (3)
- [§4 and Abstract] §4 (Experiments) and Abstract: The headline quantitative claims (19.66% accuracy lift and 87% compute reduction) are presented without statistical significance tests, standard deviations across multiple runs, or explicit hyperparameter search details; this undermines confidence in the superiority over baselines given potential selection effects or data leakage risks on the four public datasets.
- [§3.2 and §4.3] §3.2 (Dual-Adapter Strategy) and §4.3 (Efficiency Analysis): The claim that the High-Performance Adapter (deformable convolutions) plus Lightweight Adapter simultaneously delivers both the accuracy gain and the inference speedup is load-bearing, yet no ablation isolates the contribution of each adapter component, and no results are shown for an unseen modality (e.g., ultrasound) to test whether the deformable-convolution overfitting risk materializes.
- [Table 2 / §4] Table 2 or equivalent comparison table in §4: The 87% computational-cost reduction is asserted relative to 'heavyweight medical SAM adaptations,' but the table does not report the exact parameter counts, FLOPs, or inference latencies of those specific baselines, preventing independent verification of the efficiency claim.
minor comments (2)
- [§3.1] The description of the convolutional Positional Encoding Generator in §3.1 would benefit from an explicit equation or pseudocode showing how it handles arbitrary aspect ratios.
- [Figure 3] Figure 3 (architecture diagram) could include a side-by-side latency/accuracy Pareto plot to visually support the claimed trade-off.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our manuscript. We address each major comment point by point below, committing to revisions that strengthen the empirical rigor and transparency of the work where feasible.
read point-by-point responses
-
Referee: [§4 and Abstract] §4 (Experiments) and Abstract: The headline quantitative claims (19.66% accuracy lift and 87% compute reduction) are presented without statistical significance tests, standard deviations across multiple runs, or explicit hyperparameter search details; this undermines confidence in the superiority over baselines given potential selection effects or data leakage risks on the four public datasets.
Authors: We appreciate this observation on the need for greater statistical robustness. In the revised manuscript, we will rerun the key experiments across multiple random seeds to report mean performance with standard deviations. We will also expand the experimental details to include the hyperparameter search procedure and add statistical significance tests (such as paired t-tests against baselines) for the reported accuracy improvements. These additions will mitigate concerns about variability and selection effects. revision: yes
-
Referee: [§3.2 and §4.3] §3.2 (Dual-Adapter Strategy) and §4.3 (Efficiency Analysis): The claim that the High-Performance Adapter (deformable convolutions) plus Lightweight Adapter simultaneously delivers both the accuracy gain and the inference speedup is load-bearing, yet no ablation isolates the contribution of each adapter component, and no results are shown for an unseen modality (e.g., ultrasound) to test whether the deformable-convolution overfitting risk materializes.
Authors: We agree that component-wise ablations are necessary to substantiate the dual-adapter design. The revised version will include new ablation tables isolating the High-Performance Adapter (deformable convolutions), the Lightweight Adapter (re-parameterization), and their joint use, quantifying effects on both accuracy and inference latency. Our current benchmarks already span four distinct modalities (electron microscopy, endoscopy, CT, and MRI), which provides evidence against severe overfitting to a single domain. However, we lack ultrasound data and cannot perform new experiments on it without additional data acquisition outside the scope of this work; we will explicitly discuss this as a limitation and the associated overfitting risk in the revision. revision: partial
-
Referee: [Table 2 / §4] Table 2 or equivalent comparison table in §4: The 87% computational-cost reduction is asserted relative to 'heavyweight medical SAM adaptations,' but the table does not report the exact parameter counts, FLOPs, or inference latencies of those specific baselines, preventing independent verification of the efficiency claim.
Authors: We acknowledge the need for full transparency in the efficiency comparison. In the revised manuscript, we will update the relevant table to report the precise parameter counts, FLOPs, and measured inference latencies for each heavyweight medical SAM adaptation baseline (drawn from their original papers or consistent re-implementations). This will enable direct verification of the reported ~87% compute reduction. revision: yes
- Reporting results on an additional unseen modality such as ultrasound, since suitable datasets were not part of the current study and new data collection would be required.
Circularity Check
No circularity in empirical adaptation claims
full rationale
The paper is an empirical adaptation study that introduces a dual-adapter fine-tuning framework for SAM2 and validates performance via experiments on four public biomedical datasets. No mathematical derivation chain, equations, or predictions are presented that reduce claimed improvements to quantities defined solely by parameters fitted within the paper itself. The accuracy and efficiency results are reported as experimental outcomes rather than self-referential constructs, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core method.
Axiom & Free-Parameter Ledger
free parameters (3)
- Deformable convolution offset and modulation parameters
- Re-parameterization scaling factors
- Adapter hidden dimensions and ranks
axioms (1)
- domain assumption SAM2 features remain useful after domain shift when augmented by small adapter modules
Reference graph
Works this paper leans on
-
[1]
Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters
INTRODUCTION Biomedical semantic segmentation is essential for computer- aided diagnosis and quantitative analysis. Although spe- cialized models for semantic segmentation, such as U-Net [4], perform well, they are typically trained from scratch per dataset. Recently, foundation models such as SAM [2] and SAM2 [5] have shown strong generalization via larg...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELATED WORKS 2.1. Parameter Efficient Fine-Tuning (PEFT) Parameter Efficient Fine-Tuning (PEFT) adapts large pre- trained models by updating only a small subset of parameters while keeping most weights frozen, reducing training cost and storage. Representative approaches include low-rank adaptation, such as LoRA [14] and adapter-based tuning that inserts...
-
[3]
METHODOLOGY Fig.2 illustrates the overview of our method. During fine- tuning, we keep the original SAM2 components (Image En- coder, Prompt Encoder, and Mask Decoder) frozen, and up- date only the newly introduced modules marked asLearnable. 3.1. Prompt-Free Multi-Class Segmentation With SAM2 When still images are fed into SAM2, SAM2 follows the SAM-styl...
-
[4]
to mathematically fuse these branches into a single equivalent3×3convolution. Although recent methods such as RepAdapter [20] utilize re-parameterization to merge adapters into the backbone weights for zero-cost adaptation, our approach distinctively focuses on the internal topology of the adapter itself. By leveraging a multi-branch design during trainin...
-
[5]
EXPERIMENTS 4.1. Datasets and Metrics In our experiments, we focus on biomedical image seg- mentation across multiple modalities and input resolutions. Specifically, we used the ISBI2012 dataset [1] (2 classes) as a cell microscopy image, and three medical imaging datasets, Kvasir-SEG [11] (2 classes) for endoscopic images, Synapse multi-organ dataset [12...
-
[6]
CONCLUSION In this paper, we presented a prompt-free, PEFT framework that adapts SAM2 for fully automatic biomedical image segmentation. By integrating PEG and replacing the po- sitional encoding method of SAM2, our approach robustly handles the diverse resolutions and aspect ratios inherent in medical imaging without user intervention. Our core contri- b...
-
[7]
Segmentation of neuronal structures in em stacks challenge,
“Segmentation of neuronal structures in em stacks challenge,”https://imagej.net/events/ isbi-2012-segmentation-challenge, 2012
work page 2012
-
[8]
A Kirillov, E Mintun, N Ravi, H Mao, C Rolland, L Gustafson, T Xiao, S Whitehead, A. C. Berg, W. Y . Lo, P Dollar, and R Girshick, “Segment anything,” in the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026
work page 2023
-
[9]
arXiv preprint arXiv:2309.06824 (2023)
X Lin, Y Xiang, L Zhang, X Yang, Z Yan, and L Yu, “Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmentation,”arXiv preprint arXiv:2309.06824, 2023
-
[10]
U-net: Convo- lutional networks for biomedical image segmentation,
O Ronneberger, P Fischer, and T Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention, 2015, vol. 35, pp. 234–241
work page 2015
-
[11]
SAM 2: Segment anything in images and videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer, “SAM 2: Segment anything in images and videos,” inInternational Confer...
work page 2025
-
[12]
Hiera: A hierarchical vision transformer without the bells-and- whistles,
C Ryali, Y . T. Hu, D Bolya, C Wei, H Fan, P. Y . Huang, V Aggarwal, A Chowdhury, O Poursaeed, J Hoff- man, J Malik, Y Li, and C Feichtenhofer, “Hiera: A hierarchical vision transformer without the bells-and- whistles,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 29441–29454
work page 2023
-
[13]
Generalized sam: Efficient fine-tuning of sam for variable input image sizes,
S Kato, H Mitsuoka, and K Hotta, “Generalized sam: Efficient fine-tuning of sam for variable input image sizes,” inEuropean Conference on Computer Vision, 2024, pp. 167–182
work page 2024
-
[14]
Condi- tional positional encodings for vision transformers,
X Chu, Z Tian, B Zhang, X Wang, and C Shen, “Condi- tional positional encodings for vision transformers,” in International Conference on Learning Representations, 2023
work page 2023
-
[15]
Deformable convnets v2: More deformable, better results,
X Zhu, H Hu, S Lin, and J Dai, “Deformable convnets v2: More deformable, better results,” inthe IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2019, pp. 9308–9316
work page 2019
-
[16]
Repvgg: Making vgg-style convnets great again,
X Ding, X Zhang, N Ma, J Han, G Ding, and J Sun, “Repvgg: Making vgg-style convnets great again,” in the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2021, pp. 13733–13742
work page 2021
-
[17]
Kvasir- seg: A segmented polyp dataset,
D Jha, P. H. Smedsrud, M. A. Riegler, P Halvorsen, T De Lange, D Johansen, and H. D. Johansen, “Kvasir- seg: A segmented polyp dataset,” inMulti Media Mod- eling. Springer, 2020, pp. 451–462
work page 2020
-
[18]
Multi-atlas labeling beyond the cranial vault - workshop and challenge,
“Multi-atlas labeling beyond the cranial vault - workshop and challenge,”https://doi.org/10. 7303/syn3193805, 2015
work page 2015
-
[19]
O Bernard, A Lalande, C Zotti, F Cervenansky, X Yang, P. A. Heng, I Cetin, K Lekadir, O Camara, M. A. Gonza- lez Ballester, G Sanroma, S Napel, S Petersen, G Tziri- tas, E Grinias, M Khened, V . A. Kollerathu, G Krishna- murthi, M. M. Roh´e, X Pennec, M Sermesant, F Isensee, P J¨ager, K. H. Maier-Hein, P. M. Full, I Wolf, S Engel- hardt, C. F. Baumgartner...
work page 2018
-
[20]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y Shen, P Wallis, Z Allen-Zhu, Y Li, S Wang, L Wang, and W Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[21]
Adaptformer: Adapting vision transformers for scalable visual recognition,
S Chen, C Ge, Z Tong, J Wang, Y Song, J Wang, and P Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,”Advances in Neural Infor- mation Processing Systems, vol. 35, pp. 16664–16678, 2022
work page 2022
-
[22]
L. C. Chen, G Papandreou, I Kokkinos, K Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017
work page 2017
-
[23]
Con- volution meets lora: Parameter efficient finetuning for segment anything model,
Z Zhong, Z Tang, T He, H Fang, and C Yuan, “Con- volution meets lora: Parameter efficient finetuning for segment anything model,” inInternational Conference on Learning Representations, 2024
work page 2024
-
[24]
Gaussian Error Linear Units (GELUs)
D Hendrycks, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review arXiv 2016
-
[25]
Golden cudgel network for real-time semantic segmentation,
G Yang, Y Wang, D Shi, and Y Wang, “Golden cudgel network for real-time semantic segmentation,” inthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 25367–25376
work page 2025
-
[26]
Towards efficient visual adaption via structural re-parameterization,
G Luo, M Huang, Y Zhou, X Sun, G Jiang, Z Wang, and R Ji, “Towards efficient visual adaption via structural re-parameterization,”arXiv preprint arXiv:2302.08106, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.