pith. sign in

arxiv: 2605.05979 · v1 · submitted 2026-05-07 · 💻 cs.CV

Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters

Pith reviewed 2026-05-09 15:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords SAM2 adaptationbiomedical semantic segmentationparameter-efficient fine-tuningdual adaptersdeformable convolutionsprompt-freemedical image analysisre-parameterization
0
0 comments X

The pith

Dual-adapter fine-tuning adapts SAM2 for prompt-free biomedical segmentation with major efficiency gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a fine-tuning approach for the Segment Anything Model 2 that removes the need for user prompts and sharply reduces computation when applied to medical images. It adds a convolutional generator to create positional encodings that work with images of any aspect ratio and introduces two adapters: one that uses deformable convolutions to trace object boundaries accurately and another that re-parameterizes layers to speed up inference. Experiments on four biomedical datasets show accuracy rising by as much as 19.66 percent over the original SAM2 while cutting compute by roughly 87 percent relative to earlier heavy medical adaptations of the model. Readers would care because this combination makes a strong general-purpose segmentation model practical for clinical tasks where prompts are impractical and hardware resources are limited.

Core claim

The authors establish that a prompt-free parameter-efficient fine-tuning framework with a convolutional Positional Encoding Generator for arbitrary aspect ratios and a dual-adapter strategy (High-Performance Adapter using deformable convolutions for precise boundary modeling and Lightweight Adapter using structural re-parameterization to minimize inference latency) enables effective multi-class semantic segmentation on variable-sized biomedical inputs, significantly outperforming vanilla SAM2 and prior heavyweight medical adaptations on the ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets.

What carries the argument

Dual-adapter strategy with High-Performance Adapter using deformable convolutions for boundary modeling and Lightweight Adapter using structural re-parameterization for low latency, together with convolutional Positional Encoding Generator to handle variable aspect ratios.

If this is right

  • The method supports multi-class segmentation on biomedical images of arbitrary sizes without manual prompts.
  • It delivers up to 19.66 percent higher accuracy than vanilla SAM2 across the tested datasets.
  • Computational costs drop by approximately 87 percent compared with heavyweight medical SAM adaptations.
  • The re-parameterized lightweight adapter enables lower-latency inference suitable for clinical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-adapter pattern could transfer to adapting other vision foundation models to specialized domains such as satellite or industrial imagery.
  • Reduced compute requirements may allow SAM2-based segmentation on standard hospital workstations without dedicated high-end GPUs.
  • Combining the adapters with further efficiency techniques like quantization could yield additional speed improvements for real-time applications.

Load-bearing premise

The dual-adapter design with deformable convolutions and re-parameterization will maintain its accuracy-efficiency trade-off when applied to new biomedical datasets or tasks beyond the four evaluated.

What would settle it

Testing the adapted model on an additional biomedical dataset such as brain tumor segmentation and checking whether the reported accuracy gain over vanilla SAM2 and the 87 percent computational reduction are preserved.

Figures

Figures reproduced from arXiv: 2605.05979 by Hinako Mitsuoka, Kazuhiro Hotta.

Figure 1
Figure 1. Figure 1: Fig.1. We freeze the prompt encoder and extend the mask de view at source ↗
read the original abstract

Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66\% over the vanilla SAM2, while reducing computational costs by approximately 87\% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a prompt-free, parameter-efficient fine-tuning framework for adapting SAM2 to multi-class biomedical semantic segmentation on variable-sized inputs. It introduces a convolutional Positional Encoding Generator and a dual-adapter strategy consisting of a High-Performance Adapter (deformable convolutions for boundary modeling) and a Lightweight Adapter (structural re-parameterization for reduced inference latency). Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC report up to 19.66% accuracy gains over vanilla SAM2 and ~87% compute reduction versus prior heavyweight medical SAM adaptations.

Significance. If the empirical claims hold with fuller validation, the work provides a practical accuracy-efficiency trade-off for adapting large foundation models like SAM2 to biomedical domains without prompts or heavy compute, leveraging public benchmarks for reproducibility. The dual-adapter design and re-parameterization approach could inform efficient adaptation strategies in medical imaging if generalization is demonstrated.

major comments (3)
  1. [§4 and Abstract] §4 (Experiments) and Abstract: The headline quantitative claims (19.66% accuracy lift and 87% compute reduction) are presented without statistical significance tests, standard deviations across multiple runs, or explicit hyperparameter search details; this undermines confidence in the superiority over baselines given potential selection effects or data leakage risks on the four public datasets.
  2. [§3.2 and §4.3] §3.2 (Dual-Adapter Strategy) and §4.3 (Efficiency Analysis): The claim that the High-Performance Adapter (deformable convolutions) plus Lightweight Adapter simultaneously delivers both the accuracy gain and the inference speedup is load-bearing, yet no ablation isolates the contribution of each adapter component, and no results are shown for an unseen modality (e.g., ultrasound) to test whether the deformable-convolution overfitting risk materializes.
  3. [Table 2 / §4] Table 2 or equivalent comparison table in §4: The 87% computational-cost reduction is asserted relative to 'heavyweight medical SAM adaptations,' but the table does not report the exact parameter counts, FLOPs, or inference latencies of those specific baselines, preventing independent verification of the efficiency claim.
minor comments (2)
  1. [§3.1] The description of the convolutional Positional Encoding Generator in §3.1 would benefit from an explicit equation or pseudocode showing how it handles arbitrary aspect ratios.
  2. [Figure 3] Figure 3 (architecture diagram) could include a side-by-side latency/accuracy Pareto plot to visually support the claimed trade-off.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript. We address each major comment point by point below, committing to revisions that strengthen the empirical rigor and transparency of the work where feasible.

read point-by-point responses
  1. Referee: [§4 and Abstract] §4 (Experiments) and Abstract: The headline quantitative claims (19.66% accuracy lift and 87% compute reduction) are presented without statistical significance tests, standard deviations across multiple runs, or explicit hyperparameter search details; this undermines confidence in the superiority over baselines given potential selection effects or data leakage risks on the four public datasets.

    Authors: We appreciate this observation on the need for greater statistical robustness. In the revised manuscript, we will rerun the key experiments across multiple random seeds to report mean performance with standard deviations. We will also expand the experimental details to include the hyperparameter search procedure and add statistical significance tests (such as paired t-tests against baselines) for the reported accuracy improvements. These additions will mitigate concerns about variability and selection effects. revision: yes

  2. Referee: [§3.2 and §4.3] §3.2 (Dual-Adapter Strategy) and §4.3 (Efficiency Analysis): The claim that the High-Performance Adapter (deformable convolutions) plus Lightweight Adapter simultaneously delivers both the accuracy gain and the inference speedup is load-bearing, yet no ablation isolates the contribution of each adapter component, and no results are shown for an unseen modality (e.g., ultrasound) to test whether the deformable-convolution overfitting risk materializes.

    Authors: We agree that component-wise ablations are necessary to substantiate the dual-adapter design. The revised version will include new ablation tables isolating the High-Performance Adapter (deformable convolutions), the Lightweight Adapter (re-parameterization), and their joint use, quantifying effects on both accuracy and inference latency. Our current benchmarks already span four distinct modalities (electron microscopy, endoscopy, CT, and MRI), which provides evidence against severe overfitting to a single domain. However, we lack ultrasound data and cannot perform new experiments on it without additional data acquisition outside the scope of this work; we will explicitly discuss this as a limitation and the associated overfitting risk in the revision. revision: partial

  3. Referee: [Table 2 / §4] Table 2 or equivalent comparison table in §4: The 87% computational-cost reduction is asserted relative to 'heavyweight medical SAM adaptations,' but the table does not report the exact parameter counts, FLOPs, or inference latencies of those specific baselines, preventing independent verification of the efficiency claim.

    Authors: We acknowledge the need for full transparency in the efficiency comparison. In the revised manuscript, we will update the relevant table to report the precise parameter counts, FLOPs, and measured inference latencies for each heavyweight medical SAM adaptation baseline (drawn from their original papers or consistent re-implementations). This will enable direct verification of the reported ~87% compute reduction. revision: yes

standing simulated objections not resolved
  • Reporting results on an additional unseen modality such as ultrasound, since suitable datasets were not part of the current study and new data collection would be required.

Circularity Check

0 steps flagged

No circularity in empirical adaptation claims

full rationale

The paper is an empirical adaptation study that introduces a dual-adapter fine-tuning framework for SAM2 and validates performance via experiments on four public biomedical datasets. No mathematical derivation chain, equations, or predictions are presented that reduce claimed improvements to quantities defined solely by parameters fitted within the paper itself. The accuracy and efficiency results are reported as experimental outcomes rather than self-referential constructs, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core method.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about transferability of SAM2 features plus several learned components whose sizes and behaviors are determined during fine-tuning on the target datasets.

free parameters (3)
  • Deformable convolution offset and modulation parameters
    Learned during training of the high-performance adapter to model boundaries.
  • Re-parameterization scaling factors
    Chosen to trade off accuracy against inference speed in the lightweight adapter.
  • Adapter hidden dimensions and ranks
    Typical hyper-parameters in parameter-efficient fine-tuning that control capacity.
axioms (1)
  • domain assumption SAM2 features remain useful after domain shift when augmented by small adapter modules
    Core premise enabling the prompt-free adaptation strategy.

pith-pipeline@v0.9.0 · 5459 in / 1210 out tokens · 27949 ms · 2026-05-09T15:29:21.400399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters

    INTRODUCTION Biomedical semantic segmentation is essential for computer- aided diagnosis and quantitative analysis. Although spe- cialized models for semantic segmentation, such as U-Net [4], perform well, they are typically trained from scratch per dataset. Recently, foundation models such as SAM [2] and SAM2 [5] have shown strong generalization via larg...

  2. [2]

    RELATED WORKS 2.1. Parameter Efficient Fine-Tuning (PEFT) Parameter Efficient Fine-Tuning (PEFT) adapts large pre- trained models by updating only a small subset of parameters while keeping most weights frozen, reducing training cost and storage. Representative approaches include low-rank adaptation, such as LoRA [14] and adapter-based tuning that inserts...

  3. [3]

    METHODOLOGY Fig.2 illustrates the overview of our method. During fine- tuning, we keep the original SAM2 components (Image En- coder, Prompt Encoder, and Mask Decoder) frozen, and up- date only the newly introduced modules marked asLearnable. 3.1. Prompt-Free Multi-Class Segmentation With SAM2 When still images are fed into SAM2, SAM2 follows the SAM-styl...

  4. [4]

    to mathematically fuse these branches into a single equivalent3×3convolution. Although recent methods such as RepAdapter [20] utilize re-parameterization to merge adapters into the backbone weights for zero-cost adaptation, our approach distinctively focuses on the internal topology of the adapter itself. By leveraging a multi-branch design during trainin...

  5. [5]

    Datasets and Metrics In our experiments, we focus on biomedical image seg- mentation across multiple modalities and input resolutions

    EXPERIMENTS 4.1. Datasets and Metrics In our experiments, we focus on biomedical image seg- mentation across multiple modalities and input resolutions. Specifically, we used the ISBI2012 dataset [1] (2 classes) as a cell microscopy image, and three medical imaging datasets, Kvasir-SEG [11] (2 classes) for endoscopic images, Synapse multi-organ dataset [12...

  6. [6]

    CONCLUSION In this paper, we presented a prompt-free, PEFT framework that adapts SAM2 for fully automatic biomedical image segmentation. By integrating PEG and replacing the po- sitional encoding method of SAM2, our approach robustly handles the diverse resolutions and aspect ratios inherent in medical imaging without user intervention. Our core contri- b...

  7. [7]

    Segmentation of neuronal structures in em stacks challenge,

    “Segmentation of neuronal structures in em stacks challenge,”https://imagej.net/events/ isbi-2012-segmentation-challenge, 2012

  8. [8]

    Segment anything,

    A Kirillov, E Mintun, N Ravi, H Mao, C Rolland, L Gustafson, T Xiao, S Whitehead, A. C. Berg, W. Y . Lo, P Dollar, and R Girshick, “Segment anything,” in the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

  9. [9]

    arXiv preprint arXiv:2309.06824 (2023)

    X Lin, Y Xiang, L Zhang, X Yang, Z Yan, and L Yu, “Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmentation,”arXiv preprint arXiv:2309.06824, 2023

  10. [10]

    U-net: Convo- lutional networks for biomedical image segmentation,

    O Ronneberger, P Fischer, and T Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention, 2015, vol. 35, pp. 234–241

  11. [11]

    SAM 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer, “SAM 2: Segment anything in images and videos,” inInternational Confer...

  12. [12]

    Hiera: A hierarchical vision transformer without the bells-and- whistles,

    C Ryali, Y . T. Hu, D Bolya, C Wei, H Fan, P. Y . Huang, V Aggarwal, A Chowdhury, O Poursaeed, J Hoff- man, J Malik, Y Li, and C Feichtenhofer, “Hiera: A hierarchical vision transformer without the bells-and- whistles,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 29441–29454

  13. [13]

    Generalized sam: Efficient fine-tuning of sam for variable input image sizes,

    S Kato, H Mitsuoka, and K Hotta, “Generalized sam: Efficient fine-tuning of sam for variable input image sizes,” inEuropean Conference on Computer Vision, 2024, pp. 167–182

  14. [14]

    Condi- tional positional encodings for vision transformers,

    X Chu, Z Tian, B Zhang, X Wang, and C Shen, “Condi- tional positional encodings for vision transformers,” in International Conference on Learning Representations, 2023

  15. [15]

    Deformable convnets v2: More deformable, better results,

    X Zhu, H Hu, S Lin, and J Dai, “Deformable convnets v2: More deformable, better results,” inthe IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2019, pp. 9308–9316

  16. [16]

    Repvgg: Making vgg-style convnets great again,

    X Ding, X Zhang, N Ma, J Han, G Ding, and J Sun, “Repvgg: Making vgg-style convnets great again,” in the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2021, pp. 13733–13742

  17. [17]

    Kvasir- seg: A segmented polyp dataset,

    D Jha, P. H. Smedsrud, M. A. Riegler, P Halvorsen, T De Lange, D Johansen, and H. D. Johansen, “Kvasir- seg: A segmented polyp dataset,” inMulti Media Mod- eling. Springer, 2020, pp. 451–462

  18. [18]

    Multi-atlas labeling beyond the cranial vault - workshop and challenge,

    “Multi-atlas labeling beyond the cranial vault - workshop and challenge,”https://doi.org/10. 7303/syn3193805, 2015

  19. [19]

    Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved?,

    O Bernard, A Lalande, C Zotti, F Cervenansky, X Yang, P. A. Heng, I Cetin, K Lekadir, O Camara, M. A. Gonza- lez Ballester, G Sanroma, S Napel, S Petersen, G Tziri- tas, E Grinias, M Khened, V . A. Kollerathu, G Krishna- murthi, M. M. Roh´e, X Pennec, M Sermesant, F Isensee, P J¨ager, K. H. Maier-Hein, P. M. Full, I Wolf, S Engel- hardt, C. F. Baumgartner...

  20. [20]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y Shen, P Wallis, Z Allen-Zhu, Y Li, S Wang, L Wang, and W Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

  21. [21]

    Adaptformer: Adapting vision transformers for scalable visual recognition,

    S Chen, C Ge, Z Tong, J Wang, Y Song, J Wang, and P Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,”Advances in Neural Infor- mation Processing Systems, vol. 35, pp. 16664–16678, 2022

  22. [22]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L. C. Chen, G Papandreou, I Kokkinos, K Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017

  23. [23]

    Con- volution meets lora: Parameter efficient finetuning for segment anything model,

    Z Zhong, Z Tang, T He, H Fang, and C Yuan, “Con- volution meets lora: Parameter efficient finetuning for segment anything model,” inInternational Conference on Learning Representations, 2024

  24. [24]

    Gaussian Error Linear Units (GELUs)

    D Hendrycks, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016

  25. [25]

    Golden cudgel network for real-time semantic segmentation,

    G Yang, Y Wang, D Shi, and Y Wang, “Golden cudgel network for real-time semantic segmentation,” inthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 25367–25376

  26. [26]

    Towards efficient visual adaption via structural re-parameterization,

    G Luo, M Huang, Y Zhou, X Sun, G Jiang, Z Wang, and R Ji, “Towards efficient visual adaption via structural re-parameterization,”arXiv preprint arXiv:2302.08106, 2023