Do Instance Priors Help Weakly Supervised Semantic Segmentation?
Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3
The pith
Adapting instance masks from a foundational model to weak labels improves semantic segmentation and cuts annotation costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SeSAM adapts the Segment Anything Model for semantic segmentation by decomposing class masks into connected components, sampling point prompts along object skeletons, selecting SAM masks using weak-label coverage, and iteratively refining labels using pseudo-labels. Integrated with a semi-supervised learning framework that balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, the method produces semantic segmentations that consistently outperform weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.
What carries the argument
SeSAM pipeline that converts SAM instance masks into class-level semantic labels via connected-component decomposition, skeleton point sampling, weak-label coverage selection, and iterative pseudo-label refinement.
If this is right
- SeSAM applies to multiple weak annotation types including coarse masks, scribbles, and points.
- Integration into semi-supervised training mixes ground-truth, SAM pseudo-labels, and high-confidence predictions to raise segmentation quality.
- The approach delivers consistent gains over weak baselines on several datasets.
- Annotation effort drops markedly compared with dense pixel-level supervision.
Where Pith is reading between the lines
- Foundational instance models can be repurposed for semantic tasks once the mismatch between instance and class outputs is bridged by targeted post-processing.
- The same adaptation pattern may apply to other dense-prediction problems where instance cues exist but class supervision remains sparse.
- Performance could improve further if selection criteria explicitly handle class overlap or small objects that current coverage rules might miss.
Load-bearing premise
The pipeline components can reliably turn SAM's instance masks into accurate class-level semantic labels without introducing systematic errors or biases.
What would settle it
On a standard benchmark, if the final segmentation accuracy after using SeSAM-adapted labels is no higher than the accuracy obtained from the weak labels alone, the central claim would be falsified.
Figures
read the original abstract
Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SeSAM, a framework adapting the Segment Anything Model (SAM) for weakly supervised semantic segmentation with weak labels (coarse masks, scribbles, points). The approach decomposes class masks into connected components, samples skeleton points as prompts, selects SAM masks via weak-label coverage, and iteratively refines pseudo-labels within a semi-supervised training loop that balances ground-truth, SAM-based, and high-confidence pseudo-labels. The central claim is that this use of instance priors from SAM yields consistent outperformance over weakly supervised baselines across benchmarks while reducing annotation cost relative to full supervision.
Significance. If the pipeline reliably produces accurate class-level labels from SAM instances without systematic scale or texture biases, the work would be significant for showing how foundation-model instance priors can improve weakly supervised dense prediction. It offers a concrete, low-cost way to leverage SAM in semantic segmentation and integrates it with semi-supervised learning, which is a practical strength. The multi-benchmark, multi-annotation-type evaluation is also a positive aspect if the quantitative support is robust.
major comments (2)
- [Method section] Method (pipeline description): the central claim that connected-component decomposition, skeleton point sampling, coverage-based selection, and iterative refinement together convert SAM instance masks into accurate semantic labels without systematic errors is load-bearing, yet no isolated ablations, error-propagation analysis (e.g., overlap handling or class misassignment rates), or direct comparison to a weak-label-to-semantic baseline that bypasses instance decomposition are provided. Without these, gains over baselines could be artifactual rather than evidence that instance priors help.
- [Experiments section] Experiments section: the abstract and summary assert consistent outperformance and cost reduction, but the absence of per-component ablation tables, quantitative error analysis, or a baseline that applies the semi-supervised framework directly to weak labels (without SAM instance conversion) makes it impossible to attribute improvements specifically to the instance-prior components.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., mIoU deltas on a primary benchmark) to support the outperformance claim.
- [Method section] Notation for the coverage metric and the iterative refinement schedule should be defined more explicitly when first introduced to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight the importance of isolating the contributions of our instance-prior components and providing a direct baseline comparison. We agree that these elements strengthen the attribution of gains to SAM's instance priors. We have revised the manuscript to address both points.
read point-by-point responses
-
Referee: [Method section] Method (pipeline description): the central claim that connected-component decomposition, skeleton point sampling, coverage-based selection, and iterative refinement together convert SAM instance masks into accurate semantic labels without systematic errors is load-bearing, yet no isolated ablations, error-propagation analysis (e.g., overlap handling or class misassignment rates), or direct comparison to a weak-label-to-semantic baseline that bypasses instance decomposition are provided. Without these, gains over baselines could be artifactual rather than evidence that instance priors help.
Authors: We agree that the current manuscript lacks isolated ablations and a direct baseline comparison that applies the semi-supervised framework to weak labels without SAM instance conversion. These analyses are necessary to rigorously attribute improvements to the instance priors. In the revised manuscript, we have added a dedicated ablation study in Section 4.3 that evaluates each component (connected-component decomposition, skeleton-based sampling, coverage-based selection, and iterative refinement) in isolation, along with quantitative error analysis including overlap handling and class misassignment rates. We have also introduced a new baseline that feeds weak labels directly into the semi-supervised loop without SAM instance decomposition. Results on the benchmarks show that the full pipeline outperforms this baseline, supporting the value of the instance priors. These additions are now included in the Experiments section. revision: yes
-
Referee: [Experiments section] Experiments section: the abstract and summary assert consistent outperformance and cost reduction, but the absence of per-component ablation tables, quantitative error analysis, or a baseline that applies the semi-supervised framework directly to weak labels (without SAM instance conversion) makes it impossible to attribute improvements specifically to the instance-prior components.
Authors: The referee correctly notes the absence of these elements in the original submission. To enable clear attribution, the revised manuscript now contains per-component ablation tables (Table 4 and supplementary material) and quantitative error-propagation metrics. We have also added the requested baseline that applies the semi-supervised training loop directly to the weak labels, bypassing SAM instance conversion. Comparative results demonstrate that incorporating the SAM-based instance priors yields measurable gains over this baseline across annotation types, while maintaining the reported annotation cost reductions. These updates appear in the revised Experiments section and support the central claim. revision: yes
Circularity Check
No circularity: purely empirical pipeline with no derivations or fitted predictions
full rationale
The paper presents SeSAM as an empirical framework that adapts SAM via connected-component decomposition, skeleton sampling, coverage-based selection, and iterative pseudo-label refinement, then evaluates it through experiments on benchmarks. No equations, first-principles derivations, parameter fittings, or predictions appear in the abstract or described content. Claims rest on comparative performance results rather than any self-referential reduction of outputs to inputs. The contribution is self-contained as a practical method description and validation, with no load-bearing self-citations or ansatzes that collapse the central result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Accessed: 2024-11-14, Available at:https://github.com/facebookresearch/segment-anything/issues/169. Jiwoon Ahn and Suha Kwak. Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4981–4990,
work page 2024
-
[2]
SAM 3: Segment Anything with Concepts
doi: 10.52202/079017-1463. URLhttps://doi.org/10.52202/079017-1463. Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment Anything with Concepts.arXiv preprint arXiv:2511.16719,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-1463
-
[3]
arXiv preprint arXiv:2304.08506 (2023)
URL https://openreview.net/forum?id=HJz6tiCqYm. Chuanfei Hu, Tianyi Xia, Shenghong Ju, and Xinde Li. When Sam Meets Medical Images: An Investi- gation of Segment Anything Model (Sam) on Multi-phase Liver Tumor Segmentation.arXiv preprint arXiv:2304.08506,
-
[4]
URLhttps://doi.org/10.1007/ 978-3-031-73195-2_27
doi: 10.1007/978-3-031-73195-2_27. URLhttps://doi.org/10.1007/ 978-3-031-73195-2_27. Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching Anything by Segmenting Anything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18963–18973,
-
[5]
Peilun Shi, Jianing Qiu, Sai Mu Dalike Abaxi, Hao Wei, Frank P-W Lo, and Wu Yuan. Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-shot Medical Segmentation.Diagnostics, 13(11):1947,
work page 1947
-
[6]
Possam: Panoptic Open-vocabulary Segment Anything.arXiv preprint arXiv:2403.09620,
Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, and Fatih Porikli. Possam: Panoptic Open-vocabulary Segment Anything.arXiv preprint arXiv:2403.09620,
-
[7]
arXiv preprint arXiv:2304.11968 (2023)
15 Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track Anything: Segment Anything Meets Videos.arXiv preprint arXiv:2304.11968,
-
[8]
7(τ1,τ2 = 0.3,0.7)provides the better trade-off for mask refinement
As shown in Fig. 7(τ1,τ2 = 0.3,0.7)provides the better trade-off for mask refinement. Point prompting Bounding box prompting Mask prompting Figure 8:Why points for prompting?Point prompts generate the best segmentation masks using SAM (left). Whereas, bounding-box prompting forces SAM to generate labels within the box and does not completely label the obj...
work page 2024
-
[9]
3.2) across datasets - Cityscapes and ADE20k
we show robustness of our hyperparameters (Sec. 3.2) across datasets - Cityscapes and ADE20k. In particular, we ablate number of points sampled from coarse mask as well as mask selection hyperparameter ((τ1,τ2)). We observe that sampling 5 points works 18 Table9:Inconsistencybetweencoarseandfineannotations.Weperformanexperimenttounderstand the label misma...
work page 2024
-
[10]
and train it on Cityscapes dataset. We observed that the model is unable to localize class ‘road’ in any of the images, and wrongly predicts it as bus (see Fig. 10). SAM2CAM achieves a mean IoU of 5.8% on Cityscapes dataset compared to 61.7% by our SeSAM using point labels with the same DeepLabv3+ segmentation network. Dependence on SAM Quality.We conduct...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.