Few-Shot Semantic Segmentation Meets SAM3
Pith reviewed 2026-05-10 18:56 UTC · model grok-4.3
The pith
A fully frozen SAM3 performs few-shot semantic segmentation at state-of-the-art levels by concatenating support and query images on a shared canvas.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By repurposing its Promptable Concept Segmentation capability, a simple spatial concatenation strategy that places support and query images into a shared canvas allows a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes, achieving state-of-the-art performance on PASCAL-5^i and COCO-20^i.
What carries the argument
The spatial concatenation of support and query images on a shared canvas that enables SAM3's pre-trained Promptable Concept Segmentation to handle few-shot tasks.
If this is right
- This minimal design already achieves state-of-the-art performance on PASCAL-5^i and COCO-20^i, outperforming many heavily engineered methods.
- Negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse.
- Strong cross-image reasoning can emerge from simple spatial formulations.
- The approach highlights limitations in how current foundation models handle conflicting prompt signals.
Where Pith is reading between the lines
- Foundation models may possess latent abilities for cross-image comparison that simple input formatting can activate without further training.
- This approach could extend to other dense prediction tasks where support-query pairing is feasible.
- Future model designs might incorporate explicit mechanisms to manage mixed positive and negative prompts more stably.
Load-bearing premise
That SAM3's pre-trained Promptable Concept Segmentation capability will reliably transfer to the few-shot setting when support and query images are simply placed side-by-side on one canvas.
What would settle it
Running the spatial concatenation method on PASCAL-5^i and COCO-20^i and checking whether segmentation accuracy exceeds that of heavily trained competitors or drops sharply when concatenation is removed would confirm or refute the central claim.
Figures
read the original abstract
Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a training-free few-shot semantic segmentation method that repurposes SAM3's Promptable Concept Segmentation (PCS) capability via a simple spatial concatenation of support and query images onto a shared canvas. A fully frozen SAM3 then performs segmentation on novel classes without fine-tuning or architectural modifications. The work reports state-of-the-art results on PASCAL-5^i and COCO-20^i, and additionally observes that negative prompts often cause prediction collapse rather than suppressing distractors in this setting.
Significance. If the empirical results hold under scrutiny, the work is significant for demonstrating that cross-image concept association can emerge from minimal spatial reformulations in large vision foundation models, offering a strong, low-effort baseline that challenges the necessity of episodic training in few-shot segmentation. The public code release supports reproducibility and enables direct verification of the minimal design.
major comments (3)
- [Method] Method section: The description of the spatial concatenation strategy does not isolate or ablate whether the side-by-side placement itself induces reliable cross-image concept transfer in PCS for unseen categories, versus reliance on unstated prompt formatting details. This is load-bearing for the central claim, as the skeptic correctly notes that if PCS was primarily trained on intra-image scenarios, simple concatenation may lead to ignored support regions or collapsed outputs, consistent with the paper's own negative-prompt observations.
- [Experiments] Experiments section: The SOTA claim on PASCAL-5^i and COCO-20^i requires explicit tables comparing against recent baselines, including mean IoU with standard deviations, ablation on concatenation variants (e.g., different spatial arrangements or mask encodings), and confirmation that no hidden prompt engineering or post-processing was used. Without these, the outperformance over heavily engineered methods cannot be fully assessed.
- [Experiments] The observation that negative prompts weaken target representations is interesting but lacks quantitative support, such as direct performance deltas with/without negative prompts across the benchmarks. This undermines the broader claim about limitations in handling conflicting prompt signals.
minor comments (2)
- Ensure consistent notation for benchmarks (PASCAL-5^i vs. PASCAL-5i) throughout the text and tables.
- [Abstract] The abstract mentions 'state-of-the-art performance' but the full manuscript should explicitly state the exact number of shots (e.g., 1-shot, 5-shot) and support/query splits used in the reported results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by expanding the method description, adding required experimental tables and ablations, and providing quantitative support for the negative prompt observations. These changes strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [Method] Method section: The description of the spatial concatenation strategy does not isolate or ablate whether the side-by-side placement itself induces reliable cross-image concept transfer in PCS for unseen categories, versus reliance on unstated prompt formatting details. This is load-bearing for the central claim, as the skeptic correctly notes that if PCS was primarily trained on intra-image scenarios, simple concatenation may lead to ignored support regions or collapsed outputs, consistent with the paper's own negative-prompt observations.
Authors: We agree that isolating the contribution of spatial concatenation is essential. In the revised manuscript, we have expanded the Method section with a precise, step-by-step description of the concatenation procedure and the exact prompt formatting provided to SAM3. We have also added ablation studies comparing side-by-side placement against alternative spatial arrangements (e.g., vertical stacking, overlapping) and different mask encodings. These results demonstrate that reliable cross-image concept transfer emerges specifically from the side-by-side layout in the frozen PCS module, independent of prompt phrasing details. We further clarify that while PCS was trained primarily on intra-image scenarios, the empirical transfer observed is enabled by the shared canvas reformulation, as evidenced by the consistent performance gains. revision: yes
-
Referee: [Experiments] Experiments section: The SOTA claim on PASCAL-5^i and COCO-20^i requires explicit tables comparing against recent baselines, including mean IoU with standard deviations, ablation on concatenation variants (e.g., different spatial arrangements or mask encodings), and confirmation that no hidden prompt engineering or post-processing was used. Without these, the outperformance over heavily engineered methods cannot be fully assessed.
Authors: We accept that more comprehensive experimental reporting is needed to substantiate the SOTA claims. The revised Experiments section now includes explicit tables reporting mean IoU with standard deviations (computed over multiple random seeds) for both PASCAL-5^i and COCO-20^i, with direct comparisons to recent baselines. We have incorporated the requested ablations on concatenation variants and mask encodings. We also explicitly confirm in the text, supplementary material, and released code that no hidden prompt engineering or post-processing steps were used beyond the described spatial concatenation and direct application of PCS. revision: yes
-
Referee: [Experiments] The observation that negative prompts weaken target representations is interesting but lacks quantitative support, such as direct performance deltas with/without negative prompts across the benchmarks. This undermines the broader claim about limitations in handling conflicting prompt signals.
Authors: We thank the referee for this suggestion. The revised manuscript now includes quantitative experiments reporting mean IoU performance deltas with and without negative prompts on both PASCAL-5^i and COCO-20^i benchmarks. These results show consistent degradation and increased collapse rates when negative prompts are applied, providing direct empirical support for the claim that negative prompts can weaken target representations in this few-shot cross-image setting and highlighting limitations in handling conflicting signals. revision: yes
Circularity Check
No significant circularity; purely empirical validation on external benchmarks
full rationale
The paper introduces a training-free spatial concatenation method to repurpose SAM3's Promptable Concept Segmentation for few-shot semantic segmentation. It reports performance on standard external benchmarks (PASCAL-5^i and COCO-20^i) without any equations, fitted parameters, derivations, or self-referential predictions. No load-bearing steps reduce to inputs by construction; the central claim rests on empirical results rather than internal redefinitions or self-citation chains. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SAM3's Promptable Concept Segmentation transfers to few-shot settings via simple spatial concatenation of support and query images
Forward citations
Cited by 1 Pith paper
-
Example-Based Object Detection
EBOD integrates SAM3 with DINOv3 and LightGlue to leverage previous error examples and suppress recurring false positives and negatives without retraining.
Reference graph
Works this paper leans on
-
[1]
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts.International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[2]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021
work page 2021
-
[3]
Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, and Carlo Masone. SANSA: Unleashing the hidden semantics in SAM2 for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[4]
Self-support few-shot semantic segmentation
Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self-support few-shot semantic segmentation. InEuropean Conference on Computer Vision (ECCV), pages 701–719. Springer, 2022
work page 2022
-
[5]
Learning to prompt segment any- thing models.arXiv preprint arXiv:2401.04651, 2024
Jiaxing Huang, Kai Jiang, Jingyi Zhang, Han Qiu, Lewei Lu, Shijian Lu, and Eric Xing. Learning to prompt segment anything models.arXiv preprint arXiv:2401.04651, 2024
-
[6]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023
work page 2023
-
[7]
Learning what not to segment: A new perspective on few-shot segmentation
Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8057–8067, 2022
work page 2022
-
[8]
Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching.International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[9]
Hypercorrelation squeeze for few-shot segmenta- tion
Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrelation squeeze for few-shot segmenta- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6941–6952, 2021
work page 2021
-
[10]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[11]
Hierarchical dense correlation distillation for few-shot segmentation
Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, and Jiaya Jia. Hierarchical dense correlation distillation for few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23641–23651, 2023. 13
work page 2023
-
[12]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021
work page 2021
-
[13]
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos.International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[14]
VRP-SAM: SAM with visual reference prompt
Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, and Zechao Li. VRP-SAM: SAM with visual reference prompt. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23565–23574, 2024
work page 2024
-
[15]
Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment network for few-shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(2):1050–1065, 2020
work page 2020
-
[16]
Adaptive FSS: a novel few-shot segmentation framework via prototype enhancement
Jing Wang, Jiangyun Li, Chen Chen, Yisi Zhang, Haoran Shen, and Tianxiang Zhang. Adaptive FSS: a novel few-shot segmentation framework via prototype enhancement. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pages 5463–5471, 2024
work page 2024
-
[17]
Yuan Wang, Naisong Luo, and Tianzhu Zhang. Focus on query: Adversarial mining transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 36:31524–31542, 2023
work page 2023
-
[18]
Eliminating feature ambiguity for few-shot segmentation
Qianxiong Xu, Guosheng Lin, Chen Change Loy, Cheng Long, Ziyue Li, and Rui Zhao. Eliminating feature ambiguity for few-shot segmentation. InEuropean Conference on Computer Vision (ECCV), pages 416–433. Springer, 2024
work page 2024
-
[19]
Qianxiong Xu, Xuanyi Liu, Lanyun Zhu, Guosheng Lin, Cheng Long, Ziyue Li, and Rui Zhao. Hybrid mamba for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 37:73858–73883, 2024
work page 2024
-
[20]
Self-calibrated cross attention network for few-shot segmentation
Qianxiong Xu, Wenting Zhao, Guosheng Lin, and Cheng Long. Self-calibrated cross attention network for few-shot segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 655–665, 2023
work page 2023
-
[21]
Unlocking the power of SAM 2 for few-shot segmentation
Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, and Rui Zhao. Unlocking the power of SAM 2 for few-shot segmentation. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[22]
Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yunchao Wei. Bridge the points: Graph- based few-shot segment anything semantically.Advances in Neural Information Processing Systems (NeurIPS), 37:33232–33261, 2024
work page 2024
-
[23]
Jian-Wei Zhang, Yifan Sun, Yi Yang, and Wei Chen. Feature-proxy transformer for few-shot segmentation.Advances in Neural Information Processing Systems (NeurIPS), 35:6575–6588, 2022
work page 2022
-
[24]
Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Xianzheng Ma, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot.International Conference on Learning Representations (ICLR), 2024. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.