Recognition: unknown
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
Pith reviewed 2026-05-10 03:06 UTC · model grok-4.3
The pith
CoCo-SAM3 improves open-vocabulary semantic segmentation by aligning synonymous prompts and enforcing inter-class competition on a unified evidence scale, achieving consistent gains across eight benchmarks with no additional training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In open-vocabulary semantic segmentation, the SAM3 model generates masks independently for each prompt, but this results in masks that cannot be directly compared across classes due to differing evidence scales and inconsistent activation for synonymous descriptions of the same class. CoCo-SAM3 decouples the inference into two steps: intra-class enhancement through alignment and aggregation of synonymous prompt evidence to reduce drift, followed by inter-class competition where all candidate classes are evaluated on a unified comparable scale for pixel-wise decisions. This produces more stable multi-class outputs and mitigates conflicts, delivering improvements on eight standard benchmarks.
What carries the argument
Intra-class prompt alignment that aggregates synonymous evidence followed by inter-class competition on a unified evidence scale for direct pixel-wise comparisons.
If this is right
- Masks from different classes become directly comparable, reducing unintended overlaps in segmentation outputs.
- Consistent evidence for each concept is built from multiple prompt variations, decreasing sensitivity to wording.
- The approach requires no changes to the underlying model, allowing immediate application to pre-trained SAM3.
- Overall inference stability increases in complex scenes with multiple overlapping categories.
Where Pith is reading between the lines
- Applying similar conflict-handling steps could benefit other vision-language models that rely on independent prompt processing.
- The gains suggest that inference-time post-processing of prompts can be as effective as architectural changes for improving zero-shot performance.
- Testing on datasets with highly ambiguous class names might reveal the limits of prompt alignment.
Load-bearing premise
Aligning evidence from synonymous prompts and normalizing it to a single scale will eliminate conflicts and inconsistencies rather than masking them or creating new ones.
What would settle it
A benchmark evaluation showing that CoCo-SAM3 either fails to improve or decreases performance on one or more of the eight open-vocabulary segmentation datasets, or produces less stable outputs in multi-class test cases.
Figures
read the original abstract
SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoCo-SAM3, a training-free post-processing extension to SAM3 for open-vocabulary semantic segmentation. It identifies two core problems—inter-class mask overlap due to incomparable evidence scales and intra-class drift from synonymous prompts—and addresses them by first aligning and aggregating evidence from synonymous prompts, then performing pixel-wise inter-class competition on a unified evidence scale. The central claim is that this decoupling yields consistent improvements across eight benchmarks without any additional training.
Significance. If the empirical gains hold under scrutiny, the work offers a lightweight, inference-only refinement that directly targets stability issues in prompt-driven multi-class segmentation. The explicit separation of intra-class aggregation from inter-class competition is a clean conceptual contribution that could transfer to other open-vocabulary pipelines; the absence of training or fine-tuning makes the method immediately usable on existing SAM3 checkpoints.
major comments (2)
- [§3.2] §3.2 (Inter-class Competition): the description of the 'unified comparable scale' is given only at a high level; the precise normalization or aggregation function that makes evidence values directly comparable across classes is not stated as an equation or algorithm, making it impossible to verify whether the claimed pixel-wise competition is free of scale-induced bias.
- [Table 2] Table 2 (Main Results): the reported gains are described as 'consistent' across eight benchmarks, yet no per-benchmark standard deviations, number of runs, or statistical significance tests are provided; without these, it is difficult to assess whether the improvements are robust or could be explained by prompt-selection variance.
minor comments (3)
- [Abstract] The abstract and §1 would benefit from a single sentence quantifying the average mIoU improvement (or range) rather than the qualitative phrase 'consistent improvements'.
- [§3.1] Notation for the intra-class aggregation step (e.g., how synonymous prompt embeddings are combined) should be introduced with a short equation even if the operation is simple averaging or max-pooling.
- [Figure 3] Figure 3 caption should explicitly state the color mapping for the 'evidence scale' visualization so readers can interpret the before/after competition maps without referring back to the text.
Simulated Author's Rebuttal
We thank the referee for the positive overall assessment and the detailed comments, which help clarify the presentation of our method. We address each major comment point by point below and will revise the manuscript to improve clarity and robustness where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Inter-class Competition): the description of the 'unified comparable scale' is given only at a high level; the precise normalization or aggregation function that makes evidence values directly comparable across classes is not stated as an equation or algorithm, making it impossible to verify whether the claimed pixel-wise competition is free of scale-induced bias.
Authors: We agree that the current description in §3.2 remains at a conceptual level. In the revised manuscript we will add the exact normalization procedure as a formal equation together with a short algorithm box. The unified scale is obtained by first aggregating the aligned evidence from synonymous prompts per class and then applying a per-pixel min-max normalization across all class evidence maps, followed by a direct argmax competition. This formulation will be presented as Equation (3) and Algorithm 1 so that readers can verify the absence of scale-induced bias. revision: yes
-
Referee: [Table 2] Table 2 (Main Results): the reported gains are described as 'consistent' across eight benchmarks, yet no per-benchmark standard deviations, number of runs, or statistical significance tests are provided; without these, it is difficult to assess whether the improvements are robust or could be explained by prompt-selection variance.
Authors: We acknowledge the value of statistical reporting. Because CoCo-SAM3 is a deterministic, training-free post-processing step, results are identical across repeated runs for any fixed prompt set; hence standard deviations over random seeds are zero and not reported. The term 'consistent' refers to positive gains on every one of the eight benchmarks under the same protocol. To address prompt-selection variance we will add a short sensitivity study (using three alternative synonymous prompt sets per class) and report the resulting mean and standard deviation in a revised Table 2 or supplementary table. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces CoCo-SAM3 as an inference-only post-processing step on top of SAM3. It identifies issues of mask overlap and prompt inconsistency, then proposes decoupling into intra-class prompt alignment/aggregation followed by inter-class competition on a unified evidence scale. No equations, fitted parameters, or derivations are present that reduce any claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claim of consistent benchmark gains rests on the logical application of the described procedure rather than any self-referential redefinition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training-free open-vocabulary segmentation with offline diffusion-augmented prototype gener- ation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3689–3698 (2024)
2024
-
[2]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Barsellotti,L.,Bianchi,L.,Messina,N.,Carrara,F.,Cornia,M.,Baraldi,L.,Falchi, F., Cucchiara, R.: Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22025–22035 (2025)
2025
-
[3]
Perception Encoder: The best visual embeddings are not at the output of the network
Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al.: Perception encoder: The best visual em- beddings are not at the output of the network. arXiv preprint arXiv:2504.13181 (2025)
work page internal anchor Pith review arXiv 2025
-
[4]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bousselham, W., Petersen, F., Ferrari, V., Kuehne, H.: Grounding everything: Emerging localization properties in vision-language transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3828–3837 (2024)
2024
-
[5]
Ad- vances in Neural Information Processing Systems32(2019)
Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Ad- vances in Neural Information Processing Systems32(2019)
2019
-
[6]
In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 1209–1218 (2018)
2018
-
[7]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
2021
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)
2024
-
[10]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
2016
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)
2022
-
[12]
Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visualobjectclasses(voc)challenge.Internationaljournalofcomputervision88(2), 303–338 (2010)
2010
-
[13]
In: European conference on computer vision
Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)
2022
-
[14]
In: Proceedings of the Winter Conference on Applications of Computer Vision
Hajimiri, S., Ben Ayed, I., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision. pp. 5061–5071 (2025)
2025
-
[15]
In: European Conference on Computer Vision
Huang, Y., Kang, D., Chen, L., Zhe, X., Jia, W., Bao, L., He, X.: Car: Class-aware regularizations for semantic segmentation. In: European Conference on Computer Vision. pp. 518–534. Springer (2022) 16 Y. Chen et al
2022
-
[16]
Jin, S., Yu, S., Zhang, B., Sun, M., Dong, Y., Xiao, J.: Feature purification mat- ters: Suppressing outlier propagation for training-free open-vocabulary semantic segmentation.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 20291–20300 (2025)
2025
-
[17]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Kim, C., Ju, D., Han, W., Yang, M.H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15033–15042 (2025)
2025
-
[18]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)
2023
-
[19]
In: European Conference on Computer Vision
Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Clearclip: Decomposing clip representations for dense vision-language inference. In: European Conference on Computer Vision. pp. 143–160. Springer (2024)
2024
-
[20]
In: European Conference on Computer Vision
Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 70–88. Springer (2024)
2024
-
[21]
Language-driven semantic segmentation,
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)
-
[22]
Li, J., Lu, Y., Zhang, Y., Xie, Y., Wang, F., Xie, Y., Qu, Y.: Target refocusing via attention redistribution for open-vocabulary semantic segmentation: An explain- ability perspective. arXiv preprint arXiv:2511.16170 (2025)
-
[23]
arXiv e-prints pp
Li, Y., Wang, H., Duan, Y., Li, X.: Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv e-prints pp. arXiv–2304 (2023)
2023
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Mar- culescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 7061–7070 (2023)
2023
-
[25]
Chordedit: One-step low-energy transport for image editing.arXiv preprint arXiv:2602.19083, 2026
Lu,L.,Chen,X.,Guo,M.,Li,S.,Wang,J.,Shi,Y.:Chordedit:One-steplow-energy transport for image editing. arXiv preprint arXiv:2602.19083 (2026)
-
[26]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 891–898 (2014)
2014
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
arXiv preprint arXiv:2602.06333 (2026)
Pei, G., Jiang, X., Yao, Y., Shu, X., Shen, F., Jeon, B.: Taming sam3 in the wild: A concept bank for open-vocabulary segmentation. arXiv preprint arXiv:2602.06333 (2026)
-
[29]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[30]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)
work page internal anchor Pith review arXiv 2024
-
[31]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ru, L., Zheng, H., Zhan, Y., Du, B.: Token contrast for weakly-supervised semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3093–3102 (2023) CoCo-SAM3 for Open-Vocabulary Semantic Segmentation 17
2023
-
[32]
In: European Conference on Computer Vision
Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation. In: European Conference on Computer Vision. pp. 139–156. Springer (2024)
2024
-
[33]
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
Shi, Y., Xie, Y., Guo, M., Lu, L., Huang, M., Wang, J., Zhu, Z., Xu, B., Huang, Z.: Mmerror: A benchmark for erroneous reasoning in vision-language models. arXiv preprint arXiv:2601.03331 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23487–23497 (2025)
2025
-
[35]
In: European conference on computer vision
Wang, F., Mei, J., Yuille, A.: Sclip: Rethinking self-attention for dense vision- language inference. In: European conference on computer vision. pp. 315–332. Springer (2024)
2024
-
[36]
In: European Conference on Computer Vision
Wysoczańska, M., Siméoni, O., Ramamonjisoa, M., Bursuc, A., Trzciński, T., Pérez, P.: Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary se- mantic segmentation. In: European Conference on Computer Vision. pp. 320–337. Springer (2024)
2024
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8256– 8265 (2019)
2019
-
[38]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xie, B., Cao, J., Xie, J., Khan, F.S., Pang, Y.: Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3426–3436 (2024)
2024
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18134– 18144 (2022)
2022
-
[40]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023)
2023
-
[41]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xuan, X., Deng, Z., Ma, K.L.: Reme: A data-centric framework for training-free open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20954–20965 (2025)
2025
-
[42]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correlations in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24677–24687 (2025)
2025
-
[43]
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing throughade20kdataset.In:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition. pp. 633–641 (2017)
2017
-
[44]
In: European conference on computer vision
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European conference on computer vision. pp. 696–712. Springer (2022) 18 Y. Chen et al. Appendix A Implementation Details For each category, we construct a prompt set consisting of the canonical class name and several synonymous expressions. The synonym set is used only to build the sem...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.