Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning

Aristeidis Sotiras; Daniel S. Marcus; Jin Yang

arxiv: 2509.10784 · v3 · submitted 2025-09-13 · 📡 eess.IV · cs.CV

Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning

Jin Yang , Daniel S. Marcus , Aristeidis Sotiras This is my paper

Pith reviewed 2026-05-18 17:22 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords medical image segmentationactive learningsemi-supervised learningfoundation modelsvolumetric imagingdomain adaptationpseudo-label selection

0 comments

The pith

A framework uses active learning and selective semi-supervised fine-tuning to adapt medical vision foundation models to volumetric segmentation under tight label budgets without source data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the ASSFT framework to improve how medical vision foundation models handle volumetric image segmentation in new target domains. It combines an active learning step that picks informative samples via two metrics with a selective semi-supervised step that keeps only reliable pseudo-labels. A sympathetic reader would care because this approach reduces the need for large numbers of new expert annotations while avoiding reliance on the original pre-training data.

Core claim

ASSFT integrates an Active Test-Time Sample Query strategy that measures Diversified Knowledge Divergence to capture both domain gaps and intra-domain variety plus Anatomical Segmentation Difficulty to focus on hard foreground structures, together with a Selective Semi-supervised Fine-tuning strategy that admits only unlabeled samples whose predictive confidence and semantic distance to labeled examples indicate low noise.

What carries the argument

The ASSFT framework that pairs the Active Test-Time Sample Query (using DKD and ASD metrics) with Selective Semi-supervised Fine-tuning that filters pseudo-labels by confidence and distance to labeled samples.

If this is right

Adaptation performance improves when annotation budgets are small compared with random selection or full supervised fine-tuning.
The model generalizes better across different volumetric medical imaging tasks by focusing on previously unlearned anatomical patterns.
Training remains stable because unreliable pseudo-labels are filtered out rather than used in bulk.
No access to source-domain data is required, broadening applicability to privacy-sensitive clinical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query metrics might help adapt foundation models in other dense prediction tasks such as registration or detection.
Combining these selection rules with very small numbers of labels could make foundation-model deployment feasible in resource-limited hospitals.
The approach invites direct comparison against other active-learning baselines on public multi-center volumetric datasets to measure robustness.

Load-bearing premise

Predictive confidence combined with semantic distance to labeled samples can reliably exclude noisy pseudo-labels while still supplying stable training signals in the target domain.

What would settle it

A controlled test on a target dataset where samples selected by DKD and ASD produce lower final Dice scores than randomly chosen samples under the same annotation budget.

Figures

Figures reproduced from arXiv: 2509.10784 by Aristeidis Sotiras, Daniel S. Marcus, Jin Yang.

**Figure 1.** Figure 1: Active Selective Semi-supervised Fine-tuning (ASSFT) of medical vision foundation models for volumetric medical image segmentation. The segmentation network was pre-trained on the source data 𝕊 = {𝑿𝑠 } and adapted to the target domain 𝕋 for downstream evaluation. ASSFT employs an Active Test Time Sample Query strategy to evaluate the information level of each target sample. This strategy employs two metric… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison among results of the medical vision foundation models fine-tuned by (A) 5% and (B) 25% samples from the AMOS2022-CT domain queried by our methods and other SOTA methods. Red boxes mark the regions where our methods exhibit better segmentation results than SOTA methods. Raw Image Ground Truth RAND ENPY LCON BADGE SANN UGTST DKD+ASD ASFDA (A) Query Budget = 5% (B) Query Budget = 30% [… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison among results of the medical vision foundation models fine-tuned by (A) 5% and (B) 30% samples from the AMOS2022-MRI domain queried by our methods and other SOTA methods. Red boxes mark the regions where our methods exhibit better segmentation results than SOTA methods. Jin Yang, Daniel S. Marcus and Aristeidis Sotiras: Preprint submitted to Elsevier Page 10 of 16 [PITH_FULL_IMAGE:f… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison among results of the medical vision foundation models fine-tuned by (A) 5% samples from the FLARE2021 domain and (B) 3-shot from the Abdominal MRI domain queried by our methods and other SOTA methods. Red boxes mark the regions where our methods exhibit better segmentation results than SOTA methods [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of distributions of Dice scores from selected (s) unlabeled samples for the Selective Semi-supervised Fine-tuning and unselected (u) unlabeled samples when adapting Med-VFMs to the AMOS2022-CT domain for five rounds (r1, r2, r3, r4, and r5). 5. Conclusion We proposed the Active Selective Semi-supervised Finetuning method to efficiently adapt medical vision foundation models to target domains … view at source ↗

read the original abstract

Medical vision foundation models remain limited in downstream tasks, particularly volumetric medical image segmentation. While fine-tuning on labeled target-domain data improves performance, existing approaches typically rely on randomly selected samples, which may fail to identify the most informative data and thus hinder adaptation. To address the limitations, we propose an Active Selective Semi-supervised Fine-tuning framework for efficient adaptation of Med-VFMs to generalize across volumetric medical image segmentation. ASSFT integrates a novel active learning strategy with selective semi-supervised learning to maximize adaptation performance under a limited annotation budget, without requiring access to source data. Specifically, we introduce an Active Test-Time Sample Query strategy that identifies informative samples from the target domain using two complementary query metrics: Diversified Knowledge Divergence and Anatomical Segmentation Difficulty. DKD quantifies both the knowledge gap between pre-training and target domains and the semantic diversity within the target dataset, enabling the selection of samples that contain previously unlearned knowledge while maintaining intra-domain diversity. ASD estimates the segmentation difficulty of target anatomical structures by measuring predictive uncertainty within foreground regions of interest, allowing the model to prioritize samples with complex anatomical patterns rather than those dominated by background uncertainty. Second, we propose a Selective Semi-supervised Fine-tuning strategy to further improve adaptation performance by leveraging unlabeled target samples. Instead of utilizing all pseudo-labeled data, the proposed method selectively incorporates reliable unlabeled samples based on predictive confidence and semantic distance to labeled samples, enabling stable semi-supervised training while avoiding noisy pseudo-labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASSFT combines custom active learning metrics with selective pseudo-labeling for efficient Med-VFM adaptation, but the lack of reported results leaves the effectiveness open.

read the letter

The main thing to know about this paper is that it describes a new framework, ASSFT, for adapting medical vision foundation models to volumetric medical image segmentation. It does this through an active learning strategy that selects informative target samples and a selective semi-supervised fine-tuning step that filters pseudo-labels. What the paper does well is identify a practical bottleneck: random selection of samples for fine-tuning wastes annotation budget, and full access to source data is often impossible. The two query metrics are a reasonable attempt to fix that. Diversified Knowledge Divergence combines domain gap measurement with diversity to pick samples that bring new knowledge without redundancy. Anatomical Segmentation Difficulty shifts focus to uncertainty in the actual anatomical regions, which avoids the common issue where background dominates uncertainty scores in medical volumes. The selective strategy then uses predictive confidence and semantic distance to decide which unlabeled samples to trust for training. This combination feels like a thoughtful integration of existing ideas tailored to the medical volumetric setting. The soft spots are around validation. The abstract and description outline the intended behavior but do not include any quantitative results, ablation studies, or comparisons to baselines like standard active learning or vanilla fine-tuning. This makes it difficult to assess whether the metrics perform as expected. The concern in the stress-test note is reasonable to raise. With a small initial labeled set, the feature space for semantic distance is sparsely populated, and in volumetric data with small, variable foreground structures, the filter could retain noisy pseudo-labels or exclude valuable ones, leading to unstable training signals. If the full paper has experiments on datasets like CT or MRI volumes, those would need to demonstrate clear improvements and robustness to small label budgets. This work is aimed at the community working on foundation model adaptation for medical imaging, particularly those interested in reducing annotation costs for new scanners or populations. A reader looking for concrete proposals on query functions and pseudo-label selection could find value in the specific design. I would recommend sending it for peer review. The problem is well-motivated and the framework is described in enough detail to allow proper evaluation, even if revisions will likely be needed around the experimental section.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the ASSFT framework for adapting medical vision foundation models (Med-VFMs) to volumetric medical image segmentation. It combines an Active Test-Time Sample Query strategy—using Diversified Knowledge Divergence (DKD) to measure knowledge gaps and semantic diversity, and Anatomical Segmentation Difficulty (ASD) to prioritize samples with complex foreground patterns—with a Selective Semi-supervised Fine-tuning strategy that filters unlabeled samples by predictive confidence and semantic distance to the labeled set. The goal is to maximize adaptation performance under a limited annotation budget without requiring source-domain data.

Significance. If the proposed query metrics and selective filtering reliably identify informative samples and stable pseudo-labels, the work could reduce annotation costs for domain adaptation in medical imaging, where labeled volumetric data is scarce. The integration of domain-specific heuristics (anatomical uncertainty and knowledge divergence) with semi-supervised selection addresses a practical constraint, though its impact hinges on empirical validation of the heuristics under small labeled pools.

major comments (2)

[Abstract] Abstract, Selective Semi-supervised Fine-tuning strategy: the claim that predictive confidence combined with semantic distance to the (few) labeled samples reliably excludes noisy pseudo-labels is not supported by any analysis or preliminary results. In volumetric data, where foreground structures occupy small fractions and exhibit high anatomical variability, a sparsely sampled feature space can make the distance metric unreliable, either retaining over-confident background noise or discarding distant but useful samples; this directly undermines the premise of stable target-domain training signals without source data.
[Abstract] Abstract, Active Test-Time Sample Query strategy: DKD and ASD are presented as complementary metrics, but no equations, implementation details, or sensitivity analysis are supplied to show how they interact or avoid redundancy (e.g., whether ASD's foreground uncertainty is already captured by DKD's knowledge-gap term). Without such grounding, it is unclear whether the active-learning component actually selects samples that improve adaptation beyond random selection.

minor comments (1)

[Abstract] The abstract repeatedly uses the acronym ASSFT before defining it; a single introductory sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We appreciate the opportunity to address the concerns regarding the support for our claims in the ASSFT framework. Below, we provide point-by-point responses to the major comments. We will revise the manuscript to incorporate additional clarifications, analyses, and details as outlined.

read point-by-point responses

Referee: [Abstract] Abstract, Selective Semi-supervised Fine-tuning strategy: the claim that predictive confidence combined with semantic distance to the (few) labeled samples reliably excludes noisy pseudo-labels is not supported by any analysis or preliminary results. In volumetric data, where foreground structures occupy small fractions and exhibit high anatomical variability, a sparsely sampled feature space can make the distance metric unreliable, either retaining over-confident background noise or discarding distant but useful samples; this directly undermines the premise of stable target-domain training signals without source data.

Authors: We acknowledge that the abstract provides a high-level summary without including preliminary results or detailed analysis of the filtering mechanism. The full manuscript presents experimental results demonstrating improved adaptation performance when using the selective strategy compared to using all pseudo-labels. However, to directly address the referee's valid concern about potential unreliability of the semantic distance metric in volumetric settings with sparse foregrounds and limited labels, we will add a dedicated preliminary analysis and ablation study in the revised version. This will include quantitative evaluation of how the combined confidence and distance criteria affect pseudo-label quality under small labeled pools, along with visualizations of selected samples. revision: yes
Referee: [Abstract] Abstract, Active Test-Time Sample Query strategy: DKD and ASD are presented as complementary metrics, but no equations, implementation details, or sensitivity analysis are supplied to show how they interact or avoid redundancy (e.g., whether ASD's foreground uncertainty is already captured by DKD's knowledge-gap term). Without such grounding, it is unclear whether the active-learning component actually selects samples that improve adaptation beyond random selection.

Authors: The abstract summarizes the two metrics at a high level, while the full manuscript provides the mathematical definitions, implementation details, and algorithmic steps for both DKD and ASD in the Methods section. DKD measures knowledge divergence from the pre-trained model combined with feature diversity in the target domain, whereas ASD specifically computes uncertainty restricted to foreground anatomical regions, which is distinct from the broader knowledge-gap term in DKD. To strengthen the demonstration of complementarity and improvement over random selection, we will include a sensitivity analysis and additional ablation experiments in the revised manuscript, quantifying the performance gain of the combined query strategy versus random sampling and individual metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: ASSFT relies on independently defined heuristics

full rationale

The paper proposes two algorithmic components—an Active Test-Time Sample Query using Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD), plus a Selective Semi-supervised Fine-tuning filter based on predictive confidence and semantic distance—without any equations, parameters, or claims that reduce to their own inputs by construction. These metrics are motivated directly by volumetric medical imaging properties (foreground sparsity, anatomical variability) and are presented as novel selection criteria rather than derived results. No self-citation chains, fitted inputs renamed as predictions, or self-definitional loops appear in the described derivation; the central claim of improved adaptation under limited labels rests on the empirical behavior of these heuristics, which remain falsifiable on external target-domain data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on unverified assumptions that the introduced metrics correctly identify informative and difficult samples and that confidence-based filtering removes noise without discarding useful signal; these are not supported by external benchmarks in the abstract.

axioms (2)

domain assumption Predictive uncertainty within foreground regions reliably indicates anatomical segmentation difficulty
This underpins the ASD query metric for prioritizing complex anatomical patterns.
domain assumption Semantic distance and predictive confidence can separate reliable from noisy pseudo-labels in the target domain
This underpins the selective incorporation rule in the semi-supervised stage.

pith-pipeline@v0.9.0 · 5801 in / 1321 out tokens · 71797 ms · 2026-05-18T17:22:25.380849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Du,Z.,Li,J.,2023

Deepbatchactivelearningbydiverse,uncertaingradientlowerbounds, in: International Conference on Learning Representations. Du,Z.,Li,J.,2023. Diffusion-basedprobabilisticuncertaintyestimationfor active domain adaptation. Advances in Neural Information Processing Systems 36, 17129–17155. Gaillochet, M., Desrosiers, C., Lombaert, H.,

work page 2023
[2]

How well do supervised 3d models transfer to medical imaging tasks?arXiv preprint arXiv:2501.11253, 2025

How well do supervised 3d models transfer to medical imaging tasks? arXiv preprint arXiv:2501.11253 . Li, X., Xia, M., Jiao, J., Zhou, S., Chang, C., Wang, Y., Guo, Y., 2023c. Hal-ia:Ahybridactivelearningframeworkusinginteractiveannotation for medical image segmentation. Medical Image Analysis 88, 102862. Luo, Z., Luo, X., Gao, Z., Wang, G.,

work page arXiv
[3]

Medical Image Analysis 82, 102616

Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge. Medical Image Analysis 82, 102616. Mahapatra,D.,Tennakoon,R.,George,Y.,Roy,S.,Bozorgtabar,B.,Ge,Z., Reyes,M.,2024.Alfredo:Activelearningwithfeaturedisentangelement and domain adaptation for medical image classification. Medical image analysis 97, 103261. Moor, M., Banerjee, O., Ab...

work page 2024
[4]

Nature 616, 259–265

Foundation models for generalist medical artificial intelligence. Nature 616, 259–265. Nath,V.,Yang,D.,Landman,B.A.,Xu,D.,Roth,H.R.,2020. Diminishing uncertainty within the training pool: Active learning for medical image segmentation. IEEE Transactions on Medical Imaging 40, 2534–2547. Ning, M., Lu, D., Wei, D., Bian, C., Yuan, C., Yu, S., Ma, K., Zheng, Y.,

work page 2020
[5]

arXiv preprint arXiv:2501.09001

Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001 . Sener, O., Savarese, S.,

work page arXiv
[6]

From theories to queries: Active learning in practice, in: Active learning and experimental design workshop in conjunction with AISTATS 2010, JMLR Workshop and Conference Proceedings. pp. 1–

work page 2010
[7]

9433–9443

Viewal: Active learning with viewpoint entropy for semantic segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9433–9443. Wang,D.,Shang,Y.,2014. Anewactivelabelingmethodfordeeplearning, in: 2014 International joint conference on neural networks (IJCNN), IEEE. pp. 112–119. Wang, F., Han, Z., Zhang, Z., ...

work page 2014
[8]

arXiv preprint arXiv:2502.14064

Triad: Vision foundation model for 3d magnetic resonance imaging. arXiv preprint arXiv:2502.14064 . Wang, X., Lian, L., Yu, S.X.,

work page arXiv
[9]

Unsupervised selective labeling for more effective semi-supervised learning, in: European conference on computer vision, Springer. pp. 427–445. Wu, T.H., Liu, Y.C., Huang, Y.K., Lee, H.Y., Su, H.T., Huang, P.C., Hsu, W.H.,2021. Redal:Region-basedanddiversity-awareactivelearningfor point cloud semantic segmentation, in: Proceedings of the IEEE/CVF internat...

work page 2021
[10]

Suggestive annotation: A deep active learning framework for biomedical image segmentation, in: Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, Springer. pp. 399–407. Zhang, S., Metaxas, D.,

work page 2017
[11]

RadiologyAdvances 2, umae035

Mrannotator: multi-anatomy and many-sequencemrisegmentationof44structures. RadiologyAdvances 2, umae035. Zhou,T.,Yang,J.,Cui,L.,Zhang,N.,Chai,S.,2024. Sbc-al:Structureand boundary consistency-based active learning for medical image segmen- tation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 283...

work page 2024
[12]

3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography

3d foun- dation ai model for generalizable disease detection in head computed tomography. arXiv preprint arXiv:2502.02779 . Jin Yang, Daniel S. Marcus and Aristeidis Sotiras:Preprint submitted to ElsevierPage 16 of 16

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Du,Z.,Li,J.,2023

Deepbatchactivelearningbydiverse,uncertaingradientlowerbounds, in: International Conference on Learning Representations. Du,Z.,Li,J.,2023. Diffusion-basedprobabilisticuncertaintyestimationfor active domain adaptation. Advances in Neural Information Processing Systems 36, 17129–17155. Gaillochet, M., Desrosiers, C., Lombaert, H.,

work page 2023

[2] [2]

How well do supervised 3d models transfer to medical imaging tasks?arXiv preprint arXiv:2501.11253, 2025

How well do supervised 3d models transfer to medical imaging tasks? arXiv preprint arXiv:2501.11253 . Li, X., Xia, M., Jiao, J., Zhou, S., Chang, C., Wang, Y., Guo, Y., 2023c. Hal-ia:Ahybridactivelearningframeworkusinginteractiveannotation for medical image segmentation. Medical Image Analysis 88, 102862. Luo, Z., Luo, X., Gao, Z., Wang, G.,

work page arXiv

[3] [3]

Medical Image Analysis 82, 102616

Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge. Medical Image Analysis 82, 102616. Mahapatra,D.,Tennakoon,R.,George,Y.,Roy,S.,Bozorgtabar,B.,Ge,Z., Reyes,M.,2024.Alfredo:Activelearningwithfeaturedisentangelement and domain adaptation for medical image classification. Medical image analysis 97, 103261. Moor, M., Banerjee, O., Ab...

work page 2024

[4] [4]

Nature 616, 259–265

Foundation models for generalist medical artificial intelligence. Nature 616, 259–265. Nath,V.,Yang,D.,Landman,B.A.,Xu,D.,Roth,H.R.,2020. Diminishing uncertainty within the training pool: Active learning for medical image segmentation. IEEE Transactions on Medical Imaging 40, 2534–2547. Ning, M., Lu, D., Wei, D., Bian, C., Yuan, C., Yu, S., Ma, K., Zheng, Y.,

work page 2020

[5] [5]

arXiv preprint arXiv:2501.09001

Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001 . Sener, O., Savarese, S.,

work page arXiv

[6] [6]

From theories to queries: Active learning in practice, in: Active learning and experimental design workshop in conjunction with AISTATS 2010, JMLR Workshop and Conference Proceedings. pp. 1–

work page 2010

[7] [7]

9433–9443

Viewal: Active learning with viewpoint entropy for semantic segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9433–9443. Wang,D.,Shang,Y.,2014. Anewactivelabelingmethodfordeeplearning, in: 2014 International joint conference on neural networks (IJCNN), IEEE. pp. 112–119. Wang, F., Han, Z., Zhang, Z., ...

work page 2014

[8] [8]

arXiv preprint arXiv:2502.14064

Triad: Vision foundation model for 3d magnetic resonance imaging. arXiv preprint arXiv:2502.14064 . Wang, X., Lian, L., Yu, S.X.,

work page arXiv

[9] [9]

Unsupervised selective labeling for more effective semi-supervised learning, in: European conference on computer vision, Springer. pp. 427–445. Wu, T.H., Liu, Y.C., Huang, Y.K., Lee, H.Y., Su, H.T., Huang, P.C., Hsu, W.H.,2021. Redal:Region-basedanddiversity-awareactivelearningfor point cloud semantic segmentation, in: Proceedings of the IEEE/CVF internat...

work page 2021

[10] [10]

Suggestive annotation: A deep active learning framework for biomedical image segmentation, in: Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, Springer. pp. 399–407. Zhang, S., Metaxas, D.,

work page 2017

[11] [11]

RadiologyAdvances 2, umae035

Mrannotator: multi-anatomy and many-sequencemrisegmentationof44structures. RadiologyAdvances 2, umae035. Zhou,T.,Yang,J.,Cui,L.,Zhang,N.,Chai,S.,2024. Sbc-al:Structureand boundary consistency-based active learning for medical image segmen- tation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 283...

work page 2024

[12] [12]

3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography

3d foun- dation ai model for generalizable disease detection in head computed tomography. arXiv preprint arXiv:2502.02779 . Jin Yang, Daniel S. Marcus and Aristeidis Sotiras:Preprint submitted to ElsevierPage 16 of 16

work page internal anchor Pith review Pith/arXiv arXiv