pith. sign in

arxiv: 2604.05433 · v1 · submitted 2026-04-07 · 💻 cs.CV

Few-Shot Semantic Segmentation Meets SAM3

Pith reviewed 2026-05-10 18:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot semantic segmentationSAM3training-freespatial concatenationpromptable concept segmentationPASCAL-5iCOCO-20i
0
0 comments X p. Extension

The pith

A fully frozen SAM3 performs few-shot semantic segmentation at state-of-the-art levels by concatenating support and query images on a shared canvas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that few-shot semantic segmentation does not require extensive training when using modern foundation models like SAM3. By placing a few annotated support images next to the query image in one canvas, the pre-trained model can directly segment the novel object class. This approach avoids fine-tuning and architectural modifications while outperforming many trained methods on standard benchmarks like PASCAL-5i and COCO-20i. It also finds that negative prompts, meant to suppress distractors, actually weaken the target and cause collapse in this setting.

Core claim

By repurposing its Promptable Concept Segmentation capability, a simple spatial concatenation strategy that places support and query images into a shared canvas allows a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes, achieving state-of-the-art performance on PASCAL-5^i and COCO-20^i.

What carries the argument

The spatial concatenation of support and query images on a shared canvas that enables SAM3's pre-trained Promptable Concept Segmentation to handle few-shot tasks.

If this is right

  • This minimal design already achieves state-of-the-art performance on PASCAL-5^i and COCO-20^i, outperforming many heavily engineered methods.
  • Negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse.
  • Strong cross-image reasoning can emerge from simple spatial formulations.
  • The approach highlights limitations in how current foundation models handle conflicting prompt signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Foundation models may possess latent abilities for cross-image comparison that simple input formatting can activate without further training.
  • This approach could extend to other dense prediction tasks where support-query pairing is feasible.
  • Future model designs might incorporate explicit mechanisms to manage mixed positive and negative prompts more stably.

Load-bearing premise

That SAM3's pre-trained Promptable Concept Segmentation capability will reliably transfer to the few-shot setting when support and query images are simply placed side-by-side on one canvas.

What would settle it

Running the spatial concatenation method on PASCAL-5^i and COCO-20^i and checking whether segmentation accuracy exceeds that of heavily trained competitors or drops sharply when concatenation is removed would confirm or refute the central claim.

Figures

Figures reproduced from arXiv: 2604.05433 by Chien-Yao Wang, Yen-Yu Lin, Yi-Jen Tsai.

Figure 1
Figure 1. Figure 1: Pipeline of our SAM3-based FSS framework. By combining instance-aware positive prompts with a unified spatial formulation, our method enables the fully frozen SAM3 to perform implicit cross-image feature matching in a single forward pass without architectural modifications. 3 Methodology In this work, we investigate how the large vision foundation model, SAM3, can be adapted to the Few-Shot Segmentation (F… view at source ↗
Figure 2
Figure 2. Figure 2: visualization of 1-shot/5-shot prediction on PASCAL- [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of negative prompt interference in SAM3. (a) shows how adding a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a training-free few-shot semantic segmentation method that repurposes SAM3's Promptable Concept Segmentation (PCS) capability via a simple spatial concatenation of support and query images onto a shared canvas. A fully frozen SAM3 then performs segmentation on novel classes without fine-tuning or architectural modifications. The work reports state-of-the-art results on PASCAL-5^i and COCO-20^i, and additionally observes that negative prompts often cause prediction collapse rather than suppressing distractors in this setting.

Significance. If the empirical results hold under scrutiny, the work is significant for demonstrating that cross-image concept association can emerge from minimal spatial reformulations in large vision foundation models, offering a strong, low-effort baseline that challenges the necessity of episodic training in few-shot segmentation. The public code release supports reproducibility and enables direct verification of the minimal design.

major comments (3)
  1. [Method] Method section: The description of the spatial concatenation strategy does not isolate or ablate whether the side-by-side placement itself induces reliable cross-image concept transfer in PCS for unseen categories, versus reliance on unstated prompt formatting details. This is load-bearing for the central claim, as the skeptic correctly notes that if PCS was primarily trained on intra-image scenarios, simple concatenation may lead to ignored support regions or collapsed outputs, consistent with the paper's own negative-prompt observations.
  2. [Experiments] Experiments section: The SOTA claim on PASCAL-5^i and COCO-20^i requires explicit tables comparing against recent baselines, including mean IoU with standard deviations, ablation on concatenation variants (e.g., different spatial arrangements or mask encodings), and confirmation that no hidden prompt engineering or post-processing was used. Without these, the outperformance over heavily engineered methods cannot be fully assessed.
  3. [Experiments] The observation that negative prompts weaken target representations is interesting but lacks quantitative support, such as direct performance deltas with/without negative prompts across the benchmarks. This undermines the broader claim about limitations in handling conflicting prompt signals.
minor comments (2)
  1. Ensure consistent notation for benchmarks (PASCAL-5^i vs. PASCAL-5i) throughout the text and tables.
  2. [Abstract] The abstract mentions 'state-of-the-art performance' but the full manuscript should explicitly state the exact number of shots (e.g., 1-shot, 5-shot) and support/query splits used in the reported results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by expanding the method description, adding required experimental tables and ablations, and providing quantitative support for the negative prompt observations. These changes strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [Method] Method section: The description of the spatial concatenation strategy does not isolate or ablate whether the side-by-side placement itself induces reliable cross-image concept transfer in PCS for unseen categories, versus reliance on unstated prompt formatting details. This is load-bearing for the central claim, as the skeptic correctly notes that if PCS was primarily trained on intra-image scenarios, simple concatenation may lead to ignored support regions or collapsed outputs, consistent with the paper's own negative-prompt observations.

    Authors: We agree that isolating the contribution of spatial concatenation is essential. In the revised manuscript, we have expanded the Method section with a precise, step-by-step description of the concatenation procedure and the exact prompt formatting provided to SAM3. We have also added ablation studies comparing side-by-side placement against alternative spatial arrangements (e.g., vertical stacking, overlapping) and different mask encodings. These results demonstrate that reliable cross-image concept transfer emerges specifically from the side-by-side layout in the frozen PCS module, independent of prompt phrasing details. We further clarify that while PCS was trained primarily on intra-image scenarios, the empirical transfer observed is enabled by the shared canvas reformulation, as evidenced by the consistent performance gains. revision: yes

  2. Referee: [Experiments] Experiments section: The SOTA claim on PASCAL-5^i and COCO-20^i requires explicit tables comparing against recent baselines, including mean IoU with standard deviations, ablation on concatenation variants (e.g., different spatial arrangements or mask encodings), and confirmation that no hidden prompt engineering or post-processing was used. Without these, the outperformance over heavily engineered methods cannot be fully assessed.

    Authors: We accept that more comprehensive experimental reporting is needed to substantiate the SOTA claims. The revised Experiments section now includes explicit tables reporting mean IoU with standard deviations (computed over multiple random seeds) for both PASCAL-5^i and COCO-20^i, with direct comparisons to recent baselines. We have incorporated the requested ablations on concatenation variants and mask encodings. We also explicitly confirm in the text, supplementary material, and released code that no hidden prompt engineering or post-processing steps were used beyond the described spatial concatenation and direct application of PCS. revision: yes

  3. Referee: [Experiments] The observation that negative prompts weaken target representations is interesting but lacks quantitative support, such as direct performance deltas with/without negative prompts across the benchmarks. This undermines the broader claim about limitations in handling conflicting prompt signals.

    Authors: We thank the referee for this suggestion. The revised manuscript now includes quantitative experiments reporting mean IoU performance deltas with and without negative prompts on both PASCAL-5^i and COCO-20^i benchmarks. These results show consistent degradation and increased collapse rates when negative prompts are applied, providing direct empirical support for the claim that negative prompts can weaken target representations in this few-shot cross-image setting and highlighting limitations in handling conflicting signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical validation on external benchmarks

full rationale

The paper introduces a training-free spatial concatenation method to repurpose SAM3's Promptable Concept Segmentation for few-shot semantic segmentation. It reports performance on standard external benchmarks (PASCAL-5^i and COCO-20^i) without any equations, fitted parameters, derivations, or self-referential predictions. No load-bearing steps reduce to inputs by construction; the central claim rests on empirical results rather than internal redefinitions or self-citation chains. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the pre-existing capabilities of SAM3 and the untested assumption that spatial layout alone suffices for cross-image prompting; no new parameters or entities are introduced.

axioms (1)
  • domain assumption SAM3's Promptable Concept Segmentation transfers to few-shot settings via simple spatial concatenation of support and query images
    This premise is required for the training-free claim to hold.

pith-pipeline@v0.9.0 · 5523 in / 1177 out tokens · 58569 ms · 2026-05-10T18:56:55.280138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Example-Based Object Detection

    cs.CV 2026-05 unverdicted novelty 4.0

    EBOD integrates SAM3 with DINOv3 and LightGlue to leverage previous error examples and suppress recurring false positives and negatives without retraining.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper

  1. [1]

    SAM 3: Segment anything with concepts.International Conference on Learning Representations (ICLR), 2026

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts.International Conference on Learning Representations (ICLR), 2026

  2. [2]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

  3. [3]

    SANSA: Unleashing the hidden semantics in SAM2 for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, and Carlo Masone. SANSA: Unleashing the hidden semantics in SAM2 for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 2025

  4. [4]

    Self-support few-shot semantic segmentation

    Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self-support few-shot semantic segmentation. InEuropean Conference on Computer Vision (ECCV), pages 701–719. Springer, 2022

  5. [5]

    Learning to prompt segment any- thing models.arXiv preprint arXiv:2401.04651, 2024

    Jiaxing Huang, Kai Jiang, Jingyi Zhang, Han Qiu, Lewei Lu, Shijian Lu, and Eric Xing. Learning to prompt segment anything models.arXiv preprint arXiv:2401.04651, 2024

  6. [6]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

  7. [7]

    Learning what not to segment: A new perspective on few-shot segmentation

    Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8057–8067, 2022

  8. [8]

    Matcher: Segment anything with one shot using all-purpose feature matching.International Conference on Learning Representations (ICLR), 2024

    Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching.International Conference on Learning Representations (ICLR), 2024

  9. [9]

    Hypercorrelation squeeze for few-shot segmenta- tion

    Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrelation squeeze for few-shot segmenta- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6941–6952, 2021

  10. [10]

    DINOv2: Learning robust visual features without supervision.International Conference on Learning Representations (ICLR), 2025

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.International Conference on Learning Representations (ICLR), 2025

  11. [11]

    Hierarchical dense correlation distillation for few-shot segmentation

    Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, and Jiaya Jia. Hierarchical dense correlation distillation for few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23641–23651, 2023. 13

  12. [12]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021

  13. [13]

    SAM 2: Segment anything in images and videos.International Conference on Learning Representations (ICLR), 2025

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos.International Conference on Learning Representations (ICLR), 2025

  14. [14]

    VRP-SAM: SAM with visual reference prompt

    Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, and Zechao Li. VRP-SAM: SAM with visual reference prompt. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23565–23574, 2024

  15. [15]

    Prior guided feature enrichment network for few-shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(2):1050–1065, 2020

    Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment network for few-shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(2):1050–1065, 2020

  16. [16]

    Adaptive FSS: a novel few-shot segmentation framework via prototype enhancement

    Jing Wang, Jiangyun Li, Chen Chen, Yisi Zhang, Haoran Shen, and Tianxiang Zhang. Adaptive FSS: a novel few-shot segmentation framework via prototype enhancement. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pages 5463–5471, 2024

  17. [17]

    Focus on query: Adversarial mining transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 36:31524–31542, 2023

    Yuan Wang, Naisong Luo, and Tianzhu Zhang. Focus on query: Adversarial mining transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 36:31524–31542, 2023

  18. [18]

    Eliminating feature ambiguity for few-shot segmentation

    Qianxiong Xu, Guosheng Lin, Chen Change Loy, Cheng Long, Ziyue Li, and Rui Zhao. Eliminating feature ambiguity for few-shot segmentation. InEuropean Conference on Computer Vision (ECCV), pages 416–433. Springer, 2024

  19. [19]

    Hybrid mamba for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 37:73858–73883, 2024

    Qianxiong Xu, Xuanyi Liu, Lanyun Zhu, Guosheng Lin, Cheng Long, Ziyue Li, and Rui Zhao. Hybrid mamba for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 37:73858–73883, 2024

  20. [20]

    Self-calibrated cross attention network for few-shot segmentation

    Qianxiong Xu, Wenting Zhao, Guosheng Lin, and Cheng Long. Self-calibrated cross attention network for few-shot segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 655–665, 2023

  21. [21]

    Unlocking the power of SAM 2 for few-shot segmentation

    Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, and Rui Zhao. Unlocking the power of SAM 2 for few-shot segmentation. InInternational Conference on Machine Learning (ICML), 2025

  22. [22]

    Bridge the points: Graph- based few-shot segment anything semantically.Advances in Neural Information Processing Systems (NeurIPS), 37:33232–33261, 2024

    Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yunchao Wei. Bridge the points: Graph- based few-shot segment anything semantically.Advances in Neural Information Processing Systems (NeurIPS), 37:33232–33261, 2024

  23. [23]

    Feature-proxy transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 35:6575–6588, 2022

    Jian-Wei Zhang, Yifan Sun, Yi Yang, and Wei Chen. Feature-proxy transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 35:6575–6588, 2022

  24. [24]

    Personalize segment anything model with one shot.International Conference on Learning Representations (ICLR), 2024

    Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Xianzheng Ma, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot.International Conference on Learning Representations (ICLR), 2024. 14