SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?
Pith reviewed 2026-05-08 04:36 UTC · model grok-4.3
The pith
SemiSAM-O1 shows that medical image segmentation with only one annotated template can approach full supervision performance by using foundation model features for label propagation and iterative refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that SemiSAM-O1, by extracting dense features from all volumes with a foundation model's encoder and deriving class prototypes from the one annotated template for similarity-based propagation to generate initial pseudo-labels, then entering an iterative loop where the model is trained from scratch on current pseudo-labels and predictions are refined with uncertainty-guided aggregation from the foundation model's feature space, significantly narrows the performance gap to full supervision while lowering online inference overhead across diverse modalities and targets.
What carries the argument
The two-stage framework consisting of feature-similarity prototype propagation for initial pseudo-labels and an iterative training-and-refinement loop using voxel-wise uncertainty to correct via neighbor aggregation in the foundation model feature space.
If this is right
- Segmentation models trained this way achieve Dice scores much closer to fully supervised ones than previous one-label methods.
- Computational costs drop because the foundation model is not queried online during the entire training process.
- The method works robustly on complex modalities where earlier foundation model-driven SSL approaches underperformed.
- Multiple refinement rounds create a cycle where better models produce better pseudo-labels and vice versa.
Where Pith is reading between the lines
- The reliance on dense features implies that foundation models trained on general data can transfer semantic understanding effectively to specialized medical domains without task-specific fine-tuning.
- Extending the uncertainty-guided correction to other semi-supervised setups could improve label quality in low-data regimes beyond medical imaging.
- Testing the method on datasets with domain shifts, such as different scanners or patient populations, would show if the feature similarity holds across variations.
Load-bearing premise
The dense features extracted by the foundation model must contain enough semantic information to accurately propagate labels from one template via similarity, and the uncertainty estimates from the trained model must correctly highlight areas for reliable correction by neighbor labels.
What would settle it
If experiments on additional complex modalities show that after several refinement iterations the segmentation accuracy stays far below fully supervised levels, or if uncertainty guidance introduces more errors than it corrects, the central claim would be falsified.
Figures
read the original abstract
Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model's feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model's encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model's global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SemiSAM-O1, an annotation-efficient SSL framework for medical image segmentation that requires only a single annotated template. Stage 1 extracts dense features once from a foundation model encoder and propagates class prototypes via cosine similarity to generate initial pseudo-labels for the unlabeled pool. Stage 2 runs an iterative loop that trains a segmentation model from scratch on current pseudo-labels, computes voxel-wise uncertainty, and refines high-uncertainty regions by aggregating labels from the most similar confident neighbors in the same foundation feature space. The authors claim this narrows the one-label to full-supervision performance gap across modalities while reducing online foundation-model inference overhead.
Significance. If the central empirical claims are substantiated with rigorous validation, the work would advance annotation-efficient medical segmentation by demonstrating that offline foundation-model features can support extreme one-label SSL through prototype propagation and iterative refinement. The design choice to extract features once (avoiding repeated online inference) is a practical strength that addresses a common computational bottleneck. However, the absence of diagnostics isolating initial propagation quality and the reuse of the same feature space for refinement introduce risks that must be addressed before the significance can be fully assessed.
major comments (3)
- [§3.1] §3.1 (Prototype Propagation): The initial pseudo-label generation assumes cosine similarity in the foundation-model dense feature space provides a reliable proxy for semantic label agreement when propagating from a single template. No quantitative diagnostic (e.g., Dice scores or error maps against a held-out ground-truth subset) is reported for these coarse labels. This is load-bearing for the headline claim, because systematic mislabeling of entire anatomical structures due to inter-subject variability or pathology would produce correlated errors that the subsequent scratch-trained model is unlikely to correct.
- [§3.2] §3.2 (Uncertainty-Guided Refinement): The refinement step re-uses the identical foundation-model feature space both to generate the initial pseudo-labels and to identify “most similar confident neighbors” for correcting high-uncertainty voxels. This creates a potential circularity in which any initial feature-space biases are reinforced rather than independently validated. The manuscript provides no separate ablation or external anchor (e.g., comparison against an independently supervised reference) that isolates propagation accuracy from later refinement gains.
- [§4] §4 (Experiments): The abstract asserts “extensive experiments across modalities” yet the reported results omit ablation studies on the two free parameters (uncertainty threshold and number of refinement rounds), statistical significance tests, and direct comparison of initial versus final pseudo-label quality. Without these, it is impossible to determine whether the iterative loop genuinely narrows the supervision gap or merely stabilizes around the quality of the first-stage propagation.
minor comments (2)
- [Abstract] The abstract’s phrasing “significantly narrows the performance gap” should be accompanied by the actual quantitative deltas (Dice, HD95, etc.) relative to the one-label and full-supervision baselines.
- [§3.2] Notation for voxel-wise uncertainty (e.g., whether it is entropy, variance, or Monte-Carlo dropout) is introduced without an explicit equation; adding a short formula in §3.2 would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which identify key areas where additional evidence and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Prototype Propagation): The initial pseudo-label generation assumes cosine similarity in the foundation-model dense feature space provides a reliable proxy for semantic label agreement when propagating from a single template. No quantitative diagnostic (e.g., Dice scores or error maps against a held-out ground-truth subset) is reported for these coarse labels. This is load-bearing for the headline claim, because systematic mislabeling of entire anatomical structures due to inter-subject variability or pathology would produce correlated errors that the subsequent scratch-trained model is unlikely to correct.
Authors: We agree that quantitative diagnostics for the initial pseudo-labels are important to validate the prototype propagation step. While the final performance gains indicate that the initial labels provide a viable starting point, we acknowledge that direct evaluation against ground truth would better substantiate the claim. In the revised manuscript, we will add Dice scores computed on a held-out subset of volumes with ground-truth annotations, along with qualitative error maps, to characterize the quality and failure modes of the first-stage propagation. revision: yes
-
Referee: [§3.2] §3.2 (Uncertainty-Guided Refinement): The refinement step re-uses the identical foundation-model feature space both to generate the initial pseudo-labels and to identify “most similar confident neighbors” for correcting high-uncertainty voxels. This creates a potential circularity in which any initial feature-space biases are reinforced rather than independently validated. The manuscript provides no separate ablation or external anchor (e.g., comparison against an independently supervised reference) that isolates propagation accuracy from later refinement gains.
Authors: We appreciate the concern about potential circularity. The uncertainty estimates originate from a segmentation model trained from scratch on the current pseudo-labels, supplying an independent learning signal that can correct initial errors before the feature-space aggregation is applied. Nevertheless, to isolate the refinement contribution, we will add an ablation in the revision that reports performance before and after each refinement round and compares the uncertainty-guided approach against a baseline that applies refinement without model-derived uncertainty. This will clarify the incremental gains attributable to the iterative loop. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts “extensive experiments across modalities” yet the reported results omit ablation studies on the two free parameters (uncertainty threshold and number of refinement rounds), statistical significance tests, and direct comparison of initial versus final pseudo-label quality. Without these, it is impossible to determine whether the iterative loop genuinely narrows the supervision gap or merely stabilizes around the quality of the first-stage propagation.
Authors: We agree that the current experimental section would benefit from these additional analyses. In the revised manuscript we will include: (i) ablation tables varying the uncertainty threshold and the number of refinement rounds, (ii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) for all reported improvements over baselines, and (iii) a direct quantitative comparison of initial pseudo-label Dice scores versus final refined pseudo-label scores across datasets. These additions will demonstrate the iterative loop’s contribution beyond the first-stage propagation. revision: yes
Circularity Check
No significant circularity; empirical framework is self-contained
full rationale
The paper presents a two-stage empirical method: foundation-model feature extraction for prototype propagation to generate initial pseudo-labels, followed by iterative scratch training of a segmentation model with uncertainty estimation and neighbor aggregation for refinement. No equations, derivations, or claims reduce any result to its inputs by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop where output equals input, no uniqueness theorem imported via self-citation). Performance claims rest on experimental validation across tasks rather than forced equivalence. The described loop re-uses features for refinement but does not create a definitional or statistical circularity; it is an algorithmic choice whose efficacy is externally testable. This is the common honest outcome for method papers without mathematical reduction steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- uncertainty threshold for refinement
- number of refinement rounds
axioms (2)
- domain assumption Feature similarity in the foundation-model embedding space implies label similarity for medical image voxels
- domain assumption Voxel-wise uncertainty estimates from the segmentation model are reliable indicators of pseudo-label error
Reference graph
Works this paper leans on
-
[1]
Nature communications 13, 4128
The medical segmentation decathlon. Nature communications 13, 4128. Bakas, S.S., 2020. Brats miccai brain tumor dataset. URL:https://dx.doi. org/10.21227/hdtd-5j88, doi:10.21227/hdtd-5j88. Chen, C., Miao, J., Wu, D., Zhong, A., Yan, Z., Kim, S., Hu, J., Liu, Z., Sun, L., Li, X., et al., 2024. Ma-sam: Modality-agnostic sam adaptation for 3d medical image s...
-
[2]
arXiv preprint arXiv:2306.06370 (2023)
Abdomenct-1k: Is abdominal organ segmentation a solved problem? IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6695– 6714. Marks, M., Israel, U., Dilip, R., Li, Q., Yu, C., Laubscher, E., Iqbal, A., Prad- han, E., Ates, A., Abt, M., et al., 2025. Cellsam: a foundation model for cell segmentation. Nature Methods , 1–9. Miao, J., Chen, C...
-
[3]
Radiotherapy and Oncology , 111212
Multicenter deep learning-based automatic delineation of ctv and ptv in uterine malignancy ct imaging. Radiotherapy and Oncology , 111212. Xu, Z., Wang, Y ., Lu, D., Luo, X., Yan, J., Zheng, Y ., Tong, R.K.y.,
-
[4]
arXiv preprint arXiv:2304.13785 (2023)
Ambiguity-selective consistency regularization for mean-teacher semi-supervised medical image segmentation. Medical Image Analysis 88, 102880. Yu, L., Wang, S., Li, X., Fu, C.W., Heng, P.A., 2019. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2...
-
[5]
Deep adversarial networks for biomedical image segmentation uti- lizing unannotated images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 408–416. Zou, K., Goh, J.H.L., Zhou, Y ., Lin, T., Yew, S.M.E., Srinivasan, S., Wang, M., Santos, R., Somfai, G.M., Fu, H., et al., 2025. Fusionfm: Fusing eye-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.