pith. sign in

arxiv: 2604.24109 · v1 · submitted 2026-04-27 · 💻 cs.CV

SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?

Pith reviewed 2026-05-08 04:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords semi-supervised learningmedical image segmentationone-label annotationfoundation modelspseudo-labelinguncertainty refinementannotation efficiency
0
0 comments X

The pith

SemiSAM-O1 shows that medical image segmentation with only one annotated template can approach full supervision performance by using foundation model features for label propagation and iterative refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that annotation-efficient segmentation is possible even in the extreme one-label setting for medical images by fully exploiting foundation model representations. It does this through initial coarse pseudo-labels created by propagating class prototypes from the single template based on feature similarity, followed by iterative cycles of training a model from scratch and refining high-uncertainty areas using similar confident neighbors in the feature space. This would matter because medical imaging datasets often have very limited labels due to expert annotation costs, so closing the gap to full supervision with minimal data could enable broader use of deep learning in healthcare. Readers would care if this holds as it reduces both annotation burden and the computational demands of repeated foundation model calls.

Core claim

The central discovery is that SemiSAM-O1, by extracting dense features from all volumes with a foundation model's encoder and deriving class prototypes from the one annotated template for similarity-based propagation to generate initial pseudo-labels, then entering an iterative loop where the model is trained from scratch on current pseudo-labels and predictions are refined with uncertainty-guided aggregation from the foundation model's feature space, significantly narrows the performance gap to full supervision while lowering online inference overhead across diverse modalities and targets.

What carries the argument

The two-stage framework consisting of feature-similarity prototype propagation for initial pseudo-labels and an iterative training-and-refinement loop using voxel-wise uncertainty to correct via neighbor aggregation in the foundation model feature space.

If this is right

  • Segmentation models trained this way achieve Dice scores much closer to fully supervised ones than previous one-label methods.
  • Computational costs drop because the foundation model is not queried online during the entire training process.
  • The method works robustly on complex modalities where earlier foundation model-driven SSL approaches underperformed.
  • Multiple refinement rounds create a cycle where better models produce better pseudo-labels and vice versa.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on dense features implies that foundation models trained on general data can transfer semantic understanding effectively to specialized medical domains without task-specific fine-tuning.
  • Extending the uncertainty-guided correction to other semi-supervised setups could improve label quality in low-data regimes beyond medical imaging.
  • Testing the method on datasets with domain shifts, such as different scanners or patient populations, would show if the feature similarity holds across variations.

Load-bearing premise

The dense features extracted by the foundation model must contain enough semantic information to accurately propagate labels from one template via similarity, and the uncertainty estimates from the trained model must correctly highlight areas for reliable correction by neighbor labels.

What would settle it

If experiments on additional complex modalities show that after several refinement iterations the segmentation accuracy stays far below fully supervised levels, or if uncertainty guidance introduces more errors than it corrects, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.24109 by Bichun Xu, Judong Luo, Le Xue, Yichi Zhang, Yuan Cheng, Yuan Qi, Yu Fu, Zhigang Wu, Zixin Hu.

Figure 1
Figure 1. Figure 1: Comparison of SemiSAM+ and the proposed SemiSAM-O1. SemiSAM+ repeatedly invokes SAM inference during training to compute a consistency loss between the specialist model and SAM outputs, incurring substantial computational overhead with no quality control on SAM’s output reliability. In contrast, SemiSAM-O1 employs SAM only once in an offline stage to learn representations for prototype-based pseudo-label i… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed SemiSAM-O1 framework. (a) The generalist foundation model encoder extracts spatial and global features from all view at source ↗
Figure 3
Figure 3. Figure 3: Training time breakdown of SemiSAM+ and SemiSAM-O1 (R1-R3) on the LA dataset. Blue and orange segments indicate the time spent on specialist model training and generalist model inference, respectively. unlabeled). Input patches of size 128 × 128 × 128 are randomly cropped, with random flipping and rotation for data augmen￾tation. Each iterative round consists of 15,000 training itera￾tions. The pseudo-labe… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the number of iterative refinement rounds of SemiSAM-O1 on segmentation performance (Dice) using the MT and UA-MT backbones on the LA dataset view at source ↗
Figure 5
Figure 5. Figure 5: Pseudo-label quality measured by Dice against ground truth anno view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualization of segmentation predictions from R1 to R8 on representative test cases. The green area, red area, and blue area represent view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on the pseudo-label refinement step. Removing view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of segmentation results across di view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of segmentation results with and without view at source ↗
read the original abstract

Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model's feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model's encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model's global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SemiSAM-O1, an annotation-efficient SSL framework for medical image segmentation that requires only a single annotated template. Stage 1 extracts dense features once from a foundation model encoder and propagates class prototypes via cosine similarity to generate initial pseudo-labels for the unlabeled pool. Stage 2 runs an iterative loop that trains a segmentation model from scratch on current pseudo-labels, computes voxel-wise uncertainty, and refines high-uncertainty regions by aggregating labels from the most similar confident neighbors in the same foundation feature space. The authors claim this narrows the one-label to full-supervision performance gap across modalities while reducing online foundation-model inference overhead.

Significance. If the central empirical claims are substantiated with rigorous validation, the work would advance annotation-efficient medical segmentation by demonstrating that offline foundation-model features can support extreme one-label SSL through prototype propagation and iterative refinement. The design choice to extract features once (avoiding repeated online inference) is a practical strength that addresses a common computational bottleneck. However, the absence of diagnostics isolating initial propagation quality and the reuse of the same feature space for refinement introduce risks that must be addressed before the significance can be fully assessed.

major comments (3)
  1. [§3.1] §3.1 (Prototype Propagation): The initial pseudo-label generation assumes cosine similarity in the foundation-model dense feature space provides a reliable proxy for semantic label agreement when propagating from a single template. No quantitative diagnostic (e.g., Dice scores or error maps against a held-out ground-truth subset) is reported for these coarse labels. This is load-bearing for the headline claim, because systematic mislabeling of entire anatomical structures due to inter-subject variability or pathology would produce correlated errors that the subsequent scratch-trained model is unlikely to correct.
  2. [§3.2] §3.2 (Uncertainty-Guided Refinement): The refinement step re-uses the identical foundation-model feature space both to generate the initial pseudo-labels and to identify “most similar confident neighbors” for correcting high-uncertainty voxels. This creates a potential circularity in which any initial feature-space biases are reinforced rather than independently validated. The manuscript provides no separate ablation or external anchor (e.g., comparison against an independently supervised reference) that isolates propagation accuracy from later refinement gains.
  3. [§4] §4 (Experiments): The abstract asserts “extensive experiments across modalities” yet the reported results omit ablation studies on the two free parameters (uncertainty threshold and number of refinement rounds), statistical significance tests, and direct comparison of initial versus final pseudo-label quality. Without these, it is impossible to determine whether the iterative loop genuinely narrows the supervision gap or merely stabilizes around the quality of the first-stage propagation.
minor comments (2)
  1. [Abstract] The abstract’s phrasing “significantly narrows the performance gap” should be accompanied by the actual quantitative deltas (Dice, HD95, etc.) relative to the one-label and full-supervision baselines.
  2. [§3.2] Notation for voxel-wise uncertainty (e.g., whether it is entropy, variance, or Monte-Carlo dropout) is introduced without an explicit equation; adding a short formula in §3.2 would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key areas where additional evidence and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Prototype Propagation): The initial pseudo-label generation assumes cosine similarity in the foundation-model dense feature space provides a reliable proxy for semantic label agreement when propagating from a single template. No quantitative diagnostic (e.g., Dice scores or error maps against a held-out ground-truth subset) is reported for these coarse labels. This is load-bearing for the headline claim, because systematic mislabeling of entire anatomical structures due to inter-subject variability or pathology would produce correlated errors that the subsequent scratch-trained model is unlikely to correct.

    Authors: We agree that quantitative diagnostics for the initial pseudo-labels are important to validate the prototype propagation step. While the final performance gains indicate that the initial labels provide a viable starting point, we acknowledge that direct evaluation against ground truth would better substantiate the claim. In the revised manuscript, we will add Dice scores computed on a held-out subset of volumes with ground-truth annotations, along with qualitative error maps, to characterize the quality and failure modes of the first-stage propagation. revision: yes

  2. Referee: [§3.2] §3.2 (Uncertainty-Guided Refinement): The refinement step re-uses the identical foundation-model feature space both to generate the initial pseudo-labels and to identify “most similar confident neighbors” for correcting high-uncertainty voxels. This creates a potential circularity in which any initial feature-space biases are reinforced rather than independently validated. The manuscript provides no separate ablation or external anchor (e.g., comparison against an independently supervised reference) that isolates propagation accuracy from later refinement gains.

    Authors: We appreciate the concern about potential circularity. The uncertainty estimates originate from a segmentation model trained from scratch on the current pseudo-labels, supplying an independent learning signal that can correct initial errors before the feature-space aggregation is applied. Nevertheless, to isolate the refinement contribution, we will add an ablation in the revision that reports performance before and after each refinement round and compares the uncertainty-guided approach against a baseline that applies refinement without model-derived uncertainty. This will clarify the incremental gains attributable to the iterative loop. revision: yes

  3. Referee: [§4] §4 (Experiments): The abstract asserts “extensive experiments across modalities” yet the reported results omit ablation studies on the two free parameters (uncertainty threshold and number of refinement rounds), statistical significance tests, and direct comparison of initial versus final pseudo-label quality. Without these, it is impossible to determine whether the iterative loop genuinely narrows the supervision gap or merely stabilizes around the quality of the first-stage propagation.

    Authors: We agree that the current experimental section would benefit from these additional analyses. In the revised manuscript we will include: (i) ablation tables varying the uncertainty threshold and the number of refinement rounds, (ii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) for all reported improvements over baselines, and (iii) a direct quantitative comparison of initial pseudo-label Dice scores versus final refined pseudo-label scores across datasets. These additions will demonstrate the iterative loop’s contribution beyond the first-stage propagation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework is self-contained

full rationale

The paper presents a two-stage empirical method: foundation-model feature extraction for prototype propagation to generate initial pseudo-labels, followed by iterative scratch training of a segmentation model with uncertainty estimation and neighbor aggregation for refinement. No equations, derivations, or claims reduce any result to its inputs by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop where output equals input, no uniqueness theorem imported via self-citation). Performance claims rest on experimental validation across tasks rather than forced equivalence. The described loop re-uses features for refinement but does not create a definitional or statistical circularity; it is an algorithmic choice whose efficacy is externally testable. This is the common honest outcome for method papers without mathematical reduction steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the unstated premise that foundation-model features are semantically meaningful enough for nearest-neighbor label transfer and that iterative self-training converges without external supervision. No explicit free parameters or invented entities are named in the abstract, but implicit thresholds for uncertainty and similarity are required.

free parameters (2)
  • uncertainty threshold for refinement
    Used to decide which voxels get corrected by neighbor aggregation; value must be chosen or tuned.
  • number of refinement rounds
    Controls how many times the train-and-correct loop is executed.
axioms (2)
  • domain assumption Feature similarity in the foundation-model embedding space implies label similarity for medical image voxels
    Invoked in both stage 1 prototype propagation and stage 2 uncertainty-guided correction.
  • domain assumption Voxel-wise uncertainty estimates from the segmentation model are reliable indicators of pseudo-label error
    Central to deciding which regions to overwrite with neighbor labels.

pith-pipeline@v0.9.0 · 5599 in / 1430 out tokens · 29156 ms · 2026-05-08T04:36:41.403352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages

  1. [1]

    Nature communications 13, 4128

    The medical segmentation decathlon. Nature communications 13, 4128. Bakas, S.S., 2020. Brats miccai brain tumor dataset. URL:https://dx.doi. org/10.21227/hdtd-5j88, doi:10.21227/hdtd-5j88. Chen, C., Miao, J., Wu, D., Zhong, A., Yan, Z., Kim, S., Hu, J., Liu, Z., Sun, L., Li, X., et al., 2024. Ma-sam: Modality-agnostic sam adaptation for 3d medical image s...

  2. [2]

    arXiv preprint arXiv:2306.06370 (2023)

    Abdomenct-1k: Is abdominal organ segmentation a solved problem? IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6695– 6714. Marks, M., Israel, U., Dilip, R., Li, Q., Yu, C., Laubscher, E., Iqbal, A., Prad- han, E., Ates, A., Abt, M., et al., 2025. Cellsam: a foundation model for cell segmentation. Nature Methods , 1–9. Miao, J., Chen, C...

  3. [3]

    Radiotherapy and Oncology , 111212

    Multicenter deep learning-based automatic delineation of ctv and ptv in uterine malignancy ct imaging. Radiotherapy and Oncology , 111212. Xu, Z., Wang, Y ., Lu, D., Luo, X., Yan, J., Zheng, Y ., Tong, R.K.y.,

  4. [4]

    arXiv preprint arXiv:2304.13785 (2023)

    Ambiguity-selective consistency regularization for mean-teacher semi-supervised medical image segmentation. Medical Image Analysis 88, 102880. Yu, L., Wang, S., Li, X., Fu, C.W., Heng, P.A., 2019. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2...

  5. [5]

    Deep adversarial networks for biomedical image segmentation uti- lizing unannotated images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 408–416. Zou, K., Goh, J.H.L., Zhou, Y ., Lin, T., Yew, S.M.E., Srinivasan, S., Wang, M., Santos, R., Somfai, G.M., Fu, H., et al., 2025. Fusionfm: Fusing eye-...