pith. sign in

arxiv: 2604.11170 · v1 · submitted 2026-04-13 · 💻 cs.CV

Do Instance Priors Help Weakly Supervised Semantic Segmentation?

Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords weakly supervised semantic segmentationSegment Anything Modelinstance segmentationpseudo-label refinementsemi-supervised learningannotation efficiencyweak labelsconnected components
0
0 comments X

The pith

Adapting instance masks from a foundational model to weak labels improves semantic segmentation and cuts annotation costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether instance-level priors can strengthen semantic segmentation when only weak labels such as points, scribbles or coarse masks are available. It presents SeSAM, a pipeline that turns SAM's instance outputs into class-level labels by breaking masks into connected components, sampling points along object skeletons, choosing masks by coverage of the weak signal, and refining them iteratively with pseudo-labels. These adapted labels are then fed into a semi-supervised trainer that mixes ground-truth, SAM-derived, and high-confidence predictions. Experiments across benchmarks and annotation types show higher accuracy than standard weak-supervision baselines while using far less labeling effort than full pixel supervision.

Core claim

SeSAM adapts the Segment Anything Model for semantic segmentation by decomposing class masks into connected components, sampling point prompts along object skeletons, selecting SAM masks using weak-label coverage, and iteratively refining labels using pseudo-labels. Integrated with a semi-supervised learning framework that balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, the method produces semantic segmentations that consistently outperform weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

What carries the argument

SeSAM pipeline that converts SAM instance masks into class-level semantic labels via connected-component decomposition, skeleton point sampling, weak-label coverage selection, and iterative pseudo-label refinement.

If this is right

  • SeSAM applies to multiple weak annotation types including coarse masks, scribbles, and points.
  • Integration into semi-supervised training mixes ground-truth, SAM pseudo-labels, and high-confidence predictions to raise segmentation quality.
  • The approach delivers consistent gains over weak baselines on several datasets.
  • Annotation effort drops markedly compared with dense pixel-level supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Foundational instance models can be repurposed for semantic tasks once the mismatch between instance and class outputs is bridged by targeted post-processing.
  • The same adaptation pattern may apply to other dense-prediction problems where instance cues exist but class supervision remains sparse.
  • Performance could improve further if selection criteria explicitly handle class overlap or small objects that current coverage rules might miss.

Load-bearing premise

The pipeline components can reliably turn SAM's instance masks into accurate class-level semantic labels without introducing systematic errors or biases.

What would settle it

On a standard benchmark, if the final segmentation accuracy after using SeSAM-adapted labels is no higher than the accuracy obtained from the weak labels alone, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.11170 by Anna Kukleva, Anurag Das, Bernt Schiele, Xinting Hu, Yuki M. Asano.

Figure 1
Figure 1. Figure 1: Our SeSAM framework, utilizing additional instance priors, integrates with various weak labels (points, scribbles, and coarse annotations) and enhances semantic segmentation quality (see road and car on the left) while remaining cost-effective (right plot). For instance, using scribbles achieves 94% of fine￾supervised performance while requiring only 2% of the annotation budget. Abstract Semantic segmentat… view at source ↗
Figure 2
Figure 2. Figure 2: SAM challenges. Left: As an instance-based model, SAM performs poorly when segmenting a class comprising multiple instances, see the class “person”. Splitting the class mask into individual instances addresses this challenge. Middle: Confidence-based sampling (Kweon & Yoon, 2024) yields a suboptimal mask; in contrast, our proposed sampling strategy achieves comprehensive mask coverage, resulting in im￾prov… view at source ↗
Figure 3
Figure 3. Figure 3: SeSAM framework for coarse supervision. First, pseudo-labels for training images are generated by the teacher model. Next, these pseudo-labels are then refined following the key steps 1-3, namely instance separation, point sampling and mask selection (see Section 3). To improve mask quality, we augment coarse labels with high-confidence pseudo-labels before applying SAM (step 0). The resulting refined labe… view at source ↗
Figure 4
Figure 4. Figure 4: Annotation cost vs. performance plot. We compare the performance at different budgets for each of the weak labels (point, scribble and coarse), and compare the cost-effectiveness of our method with fine annotation (bottom-left). In particular, scribble supervision reaches performance comparable to fine-label supervision using only 6% of the annotation budget. Whereas, in a low-budget setting (109 hrs), it … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative performance comparison. We conduct a qualitative performance comparison of our framework with supervised models using various weak labels, including point, scribble, and coarse annotations. Overall, our method demonstrates superior class segmentation, particularly with enhanced prediction along boundaries (see car). We make two comparisons. First, we compare the performances of different weak l… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation with number of point prompts. We compute precision and recall for SAM masks for different num￾ber of point prompts. We observe N = 5 yields the best trade-off. (1,9) (3,7) (5,5) (7,3) ( 1, 2) 52 56 60 64 68 Recall (%) Cityscapes Recall Cityscapes Precision 89 90 91 92 93 94 95 Precision (%) [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Why points for prompting? Point prompts generate the best segmentation masks using SAM (left). Whereas, bounding-box prompting forces SAM to generate labels within the box and does not completely label the object (middle). Mask prompt fails to generate meaningful mask (right) as also discussed in (sam, 2024). A.2 Additional Details Why points as prompts? In our experiments, we choose points as the promptin… view at source ↗
Figure 9
Figure 9. Figure 9: Inconsistency between coarse and fine labels. In this example, we see the class wall is incorrectly labeled as fence in coarse annotation. Similarly, vegetation is wrongly labeled as road in coarse annotation. Why are scribbles better? In our experimental evaluation(Tab. 1), we observe scribbles outperforming coarse annotation, particularly for Cityscapes dataset. We have the following reasons supporting t… view at source ↗
Figure 10
Figure 10. Figure 10: Image label as weak supervision. Best performing image label WSSS method SAM2CAM (Kweon & Yoon, 2024) fails to predict correct segmentation labels for Cityscapes dataset ( Class ‘road’ is wrongly predicted as ‘bus’). In such real world datasets, classes co-occur quite often (e.g. car and road in all images) making it difficult to obtain correct CAM for class localization. where image is barely visible, th… view at source ↗
Figure 11
Figure 11. Figure 11: Top: Contrast corruption on Cityscapes image for different severity levels. Bottom: Performance comparison for Contrast and Fog corruptions across severity levels. SAM remains robust with minimal F1 drop up to severity 3. A.3 Additional Qualitative Results Sampling of points. We propose uniform probabilistic sampling from the topological skeleton of the coarse masks as discussed in Sec. 3. As shown in [P… view at source ↗
Figure 12
Figure 12. Figure 12: Skeleton-based uniform sampling. We propose uniform prob￾ability based sampling from topologi￾cal skeleton of the mask. Image Coarse mask SAM mask 1 SAM mask 2 SAM mask 3 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative performance comparison on Cityscapes. We conduct a qualitative performance comparison of our framework with supervised models using various weak labels, including point, scribble, and coarse supervision. Overall, our method demonstrates superior class segmentation, particularly with enhanced prediction along boundaries (see car in row 1, 3 and 4, wall and bicycle in row 2). 22 [PITH_FULL_IMAG… view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative performance comparison on ADE20k. We conduct a qualitative performance comparison of our framework with supervised models using various weak labels, including point, scribble, and coarse supervision. Overall, our method demonstrates superior class segmentation, particularly with enhanced prediction along boundaries (see house in row 1 and chair in row 3). 23 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
read the original abstract

Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SeSAM, a framework adapting the Segment Anything Model (SAM) for weakly supervised semantic segmentation with weak labels (coarse masks, scribbles, points). The approach decomposes class masks into connected components, samples skeleton points as prompts, selects SAM masks via weak-label coverage, and iteratively refines pseudo-labels within a semi-supervised training loop that balances ground-truth, SAM-based, and high-confidence pseudo-labels. The central claim is that this use of instance priors from SAM yields consistent outperformance over weakly supervised baselines across benchmarks while reducing annotation cost relative to full supervision.

Significance. If the pipeline reliably produces accurate class-level labels from SAM instances without systematic scale or texture biases, the work would be significant for showing how foundation-model instance priors can improve weakly supervised dense prediction. It offers a concrete, low-cost way to leverage SAM in semantic segmentation and integrates it with semi-supervised learning, which is a practical strength. The multi-benchmark, multi-annotation-type evaluation is also a positive aspect if the quantitative support is robust.

major comments (2)
  1. [Method section] Method (pipeline description): the central claim that connected-component decomposition, skeleton point sampling, coverage-based selection, and iterative refinement together convert SAM instance masks into accurate semantic labels without systematic errors is load-bearing, yet no isolated ablations, error-propagation analysis (e.g., overlap handling or class misassignment rates), or direct comparison to a weak-label-to-semantic baseline that bypasses instance decomposition are provided. Without these, gains over baselines could be artifactual rather than evidence that instance priors help.
  2. [Experiments section] Experiments section: the abstract and summary assert consistent outperformance and cost reduction, but the absence of per-component ablation tables, quantitative error analysis, or a baseline that applies the semi-supervised framework directly to weak labels (without SAM instance conversion) makes it impossible to attribute improvements specifically to the instance-prior components.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., mIoU deltas on a primary benchmark) to support the outperformance claim.
  2. [Method section] Notation for the coverage metric and the iterative refinement schedule should be defined more explicitly when first introduced to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight the importance of isolating the contributions of our instance-prior components and providing a direct baseline comparison. We agree that these elements strengthen the attribution of gains to SAM's instance priors. We have revised the manuscript to address both points.

read point-by-point responses
  1. Referee: [Method section] Method (pipeline description): the central claim that connected-component decomposition, skeleton point sampling, coverage-based selection, and iterative refinement together convert SAM instance masks into accurate semantic labels without systematic errors is load-bearing, yet no isolated ablations, error-propagation analysis (e.g., overlap handling or class misassignment rates), or direct comparison to a weak-label-to-semantic baseline that bypasses instance decomposition are provided. Without these, gains over baselines could be artifactual rather than evidence that instance priors help.

    Authors: We agree that the current manuscript lacks isolated ablations and a direct baseline comparison that applies the semi-supervised framework to weak labels without SAM instance conversion. These analyses are necessary to rigorously attribute improvements to the instance priors. In the revised manuscript, we have added a dedicated ablation study in Section 4.3 that evaluates each component (connected-component decomposition, skeleton-based sampling, coverage-based selection, and iterative refinement) in isolation, along with quantitative error analysis including overlap handling and class misassignment rates. We have also introduced a new baseline that feeds weak labels directly into the semi-supervised loop without SAM instance decomposition. Results on the benchmarks show that the full pipeline outperforms this baseline, supporting the value of the instance priors. These additions are now included in the Experiments section. revision: yes

  2. Referee: [Experiments section] Experiments section: the abstract and summary assert consistent outperformance and cost reduction, but the absence of per-component ablation tables, quantitative error analysis, or a baseline that applies the semi-supervised framework directly to weak labels (without SAM instance conversion) makes it impossible to attribute improvements specifically to the instance-prior components.

    Authors: The referee correctly notes the absence of these elements in the original submission. To enable clear attribution, the revised manuscript now contains per-component ablation tables (Table 4 and supplementary material) and quantitative error-propagation metrics. We have also added the requested baseline that applies the semi-supervised training loop directly to the weak labels, bypassing SAM instance conversion. Comparative results demonstrate that incorporating the SAM-based instance priors yields measurable gains over this baseline across annotation types, while maintaining the reported annotation cost reductions. These updates appear in the revised Experiments section and support the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivations or fitted predictions

full rationale

The paper presents SeSAM as an empirical framework that adapts SAM via connected-component decomposition, skeleton sampling, coverage-based selection, and iterative pseudo-label refinement, then evaluates it through experiments on benchmarks. No equations, first-principles derivations, parameter fittings, or predictions appear in the abstract or described content. Claims rest on comparative performance results rather than any self-referential reduction of outputs to inputs. The contribution is self-contained as a practical method description and validation, with no load-bearing self-citations or ansatzes that collapse the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not describe any free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5495 in / 1016 out tokens · 39575 ms · 2026-05-10T16:19:16.787311+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Jiwoon Ahn and Suha Kwak

    Accessed: 2024-11-14, Available at:https://github.com/facebookresearch/segment-anything/issues/169. Jiwoon Ahn and Suha Kwak. Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4981–4990,

  2. [2]

    SAM 3: Segment Anything with Concepts

    doi: 10.52202/079017-1463. URLhttps://doi.org/10.52202/079017-1463. Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment Anything with Concepts.arXiv preprint arXiv:2511.16719,

  3. [3]

    arXiv preprint arXiv:2304.08506 (2023)

    URL https://openreview.net/forum?id=HJz6tiCqYm. Chuanfei Hu, Tianyi Xia, Shenghong Ju, and Xinde Li. When Sam Meets Medical Images: An Investi- gation of Segment Anything Model (Sam) on Multi-phase Liver Tumor Segmentation.arXiv preprint arXiv:2304.08506,

  4. [4]

    URLhttps://doi.org/10.1007/ 978-3-031-73195-2_27

    doi: 10.1007/978-3-031-73195-2_27. URLhttps://doi.org/10.1007/ 978-3-031-73195-2_27. Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching Anything by Segmenting Anything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18963–18973,

  5. [5]

    Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-shot Medical Segmentation.Diagnostics, 13(11):1947,

    Peilun Shi, Jianing Qiu, Sai Mu Dalike Abaxi, Hao Wei, Frank P-W Lo, and Wu Yuan. Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-shot Medical Segmentation.Diagnostics, 13(11):1947,

  6. [6]

    Possam: Panoptic Open-vocabulary Segment Anything.arXiv preprint arXiv:2403.09620,

    Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, and Fatih Porikli. Possam: Panoptic Open-vocabulary Segment Anything.arXiv preprint arXiv:2403.09620,

  7. [7]

    arXiv preprint arXiv:2304.11968 (2023)

    15 Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track Anything: Segment Anything Meets Videos.arXiv preprint arXiv:2304.11968,

  8. [8]

    7(τ1,τ2 = 0.3,0.7)provides the better trade-off for mask refinement

    As shown in Fig. 7(τ1,τ2 = 0.3,0.7)provides the better trade-off for mask refinement. Point prompting Bounding box prompting Mask prompting Figure 8:Why points for prompting?Point prompts generate the best segmentation masks using SAM (left). Whereas, bounding-box prompting forces SAM to generate labels within the box and does not completely label the obj...

  9. [9]

    3.2) across datasets - Cityscapes and ADE20k

    we show robustness of our hyperparameters (Sec. 3.2) across datasets - Cityscapes and ADE20k. In particular, we ablate number of points sampled from coarse mask as well as mask selection hyperparameter ((τ1,τ2)). We observe that sampling 5 points works 18 Table9:Inconsistencybetweencoarseandfineannotations.Weperformanexperimenttounderstand the label misma...

  10. [10]

    We observed that the model is unable to localize class ‘road’ in any of the images, and wrongly predicts it as bus (see Fig

    and train it on Cityscapes dataset. We observed that the model is unable to localize class ‘road’ in any of the images, and wrongly predicts it as bus (see Fig. 10). SAM2CAM achieves a mean IoU of 5.8% on Cityscapes dataset compared to 61.7% by our SeSAM using point labels with the same DeepLabv3+ segmentation network. Dependence on SAM Quality.We conduct...