pith. machine review for the scientific record. sign in

arxiv: 2604.19648 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open-vocabulary semantic segmentationconcept conflictprompt alignmentinter-class competitionSAM3mask generationzero-shot inference
0
0 comments X

The pith

CoCo-SAM3 improves open-vocabulary semantic segmentation by aligning synonymous prompts and enforcing inter-class competition on a unified evidence scale, achieving consistent gains across eight benchmarks with no additional training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that independent mask generation from different category prompts in SAM3 creates problems in multi-class scenarios because the evidence lacks a shared scale for fair comparison, leading to overlapping masks and unstable decisions. It further identifies that synonymous prompts for the same concept activate inconsistent spatial and semantic evidence, worsening the conflicts. CoCo-SAM3 counters this by first combining evidence from synonymous prompts to build consistent concepts and then running competition across classes using normalized scales for direct pixel comparisons. A sympathetic reader would care because this turns an existing powerful model into a more reliable one for real-world scenes with many objects without needing new data or compute for training.

Core claim

In open-vocabulary semantic segmentation, the SAM3 model generates masks independently for each prompt, but this results in masks that cannot be directly compared across classes due to differing evidence scales and inconsistent activation for synonymous descriptions of the same class. CoCo-SAM3 decouples the inference into two steps: intra-class enhancement through alignment and aggregation of synonymous prompt evidence to reduce drift, followed by inter-class competition where all candidate classes are evaluated on a unified comparable scale for pixel-wise decisions. This produces more stable multi-class outputs and mitigates conflicts, delivering improvements on eight standard benchmarks.

What carries the argument

Intra-class prompt alignment that aggregates synonymous evidence followed by inter-class competition on a unified evidence scale for direct pixel-wise comparisons.

If this is right

  • Masks from different classes become directly comparable, reducing unintended overlaps in segmentation outputs.
  • Consistent evidence for each concept is built from multiple prompt variations, decreasing sensitivity to wording.
  • The approach requires no changes to the underlying model, allowing immediate application to pre-trained SAM3.
  • Overall inference stability increases in complex scenes with multiple overlapping categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying similar conflict-handling steps could benefit other vision-language models that rely on independent prompt processing.
  • The gains suggest that inference-time post-processing of prompts can be as effective as architectural changes for improving zero-shot performance.
  • Testing on datasets with highly ambiguous class names might reveal the limits of prompt alignment.

Load-bearing premise

Aligning evidence from synonymous prompts and normalizing it to a single scale will eliminate conflicts and inconsistencies rather than masking them or creating new ones.

What would settle it

A benchmark evaluation showing that CoCo-SAM3 either fails to improve or decreases performance on one or more of the eight open-vocabulary segmentation datasets, or produces less stable outputs in multi-class test cases.

Figures

Figures reproduced from arXiv: 2604.19648 by Baoyao Yang, Jingchao Wang, Siqi Liu, Yanhui Chen.

Figure 1
Figure 1. Figure 1: Left: Inter-class conflicts when applying SAM3 to open-vocabulary semantic segmentation. Masks for different categories are generated independently from their respective prompts, without evidence calibration on a unified scale, resulting in mu￾tual overwriting and confusion. Right: Controlled inter-class competition analysis on COCO-S. We vary the ratio of inter-class competitors p in the semantic-prior no… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of CoCo-SAM3. We enhance semantic-evidence consistency via intra-class synonym aggregation and build a unified-scale conflict prior to stabilize inter-class competition, yielding stable open-vocabulary semantic segmentation. 4.2 Synonym Aggregation In multi-class open-vocabulary inference, whether inter-class conflicts can be stably suppressed often hinges on whether intra-class evidence can f… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons. “GT” denotes the ground truth. ProxyCLIP [20], CASS [17], CorrCLIP [42], Trident [34], ReME [41]). With￾out any additional training, CoCo-SAM3 not only significantly outperforms the vanilla SAM3 [7] framework (57.5, +6.8), but also surpasses the strongest CLIP-VFM baselines ReME (55.2, +9.1) and CorrCLIP (53.6, +10.7). Further analysis by evaluation protocol shows that under the wi… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of mIoU using different PE layers (#3/#12/#18/#25/#31) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative results of our method on PC59. Image CorrCLIP SAM3 Ours GT [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative results of our method on COCO-S [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative results of our method on VOC21. Image CorrCLIP SAM3 Ours GT [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative results of our method on Cityscapes [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative results of our method on ADE20K [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes CoCo-SAM3, a training-free post-processing extension to SAM3 for open-vocabulary semantic segmentation. It identifies two core problems—inter-class mask overlap due to incomparable evidence scales and intra-class drift from synonymous prompts—and addresses them by first aligning and aggregating evidence from synonymous prompts, then performing pixel-wise inter-class competition on a unified evidence scale. The central claim is that this decoupling yields consistent improvements across eight benchmarks without any additional training.

Significance. If the empirical gains hold under scrutiny, the work offers a lightweight, inference-only refinement that directly targets stability issues in prompt-driven multi-class segmentation. The explicit separation of intra-class aggregation from inter-class competition is a clean conceptual contribution that could transfer to other open-vocabulary pipelines; the absence of training or fine-tuning makes the method immediately usable on existing SAM3 checkpoints.

major comments (2)
  1. [§3.2] §3.2 (Inter-class Competition): the description of the 'unified comparable scale' is given only at a high level; the precise normalization or aggregation function that makes evidence values directly comparable across classes is not stated as an equation or algorithm, making it impossible to verify whether the claimed pixel-wise competition is free of scale-induced bias.
  2. [Table 2] Table 2 (Main Results): the reported gains are described as 'consistent' across eight benchmarks, yet no per-benchmark standard deviations, number of runs, or statistical significance tests are provided; without these, it is difficult to assess whether the improvements are robust or could be explained by prompt-selection variance.
minor comments (3)
  1. [Abstract] The abstract and §1 would benefit from a single sentence quantifying the average mIoU improvement (or range) rather than the qualitative phrase 'consistent improvements'.
  2. [§3.1] Notation for the intra-class aggregation step (e.g., how synonymous prompt embeddings are combined) should be introduced with a short equation even if the operation is simple averaging or max-pooling.
  3. [Figure 3] Figure 3 caption should explicitly state the color mapping for the 'evidence scale' visualization so readers can interpret the before/after competition maps without referring back to the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive overall assessment and the detailed comments, which help clarify the presentation of our method. We address each major comment point by point below and will revise the manuscript to improve clarity and robustness where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Inter-class Competition): the description of the 'unified comparable scale' is given only at a high level; the precise normalization or aggregation function that makes evidence values directly comparable across classes is not stated as an equation or algorithm, making it impossible to verify whether the claimed pixel-wise competition is free of scale-induced bias.

    Authors: We agree that the current description in §3.2 remains at a conceptual level. In the revised manuscript we will add the exact normalization procedure as a formal equation together with a short algorithm box. The unified scale is obtained by first aggregating the aligned evidence from synonymous prompts per class and then applying a per-pixel min-max normalization across all class evidence maps, followed by a direct argmax competition. This formulation will be presented as Equation (3) and Algorithm 1 so that readers can verify the absence of scale-induced bias. revision: yes

  2. Referee: [Table 2] Table 2 (Main Results): the reported gains are described as 'consistent' across eight benchmarks, yet no per-benchmark standard deviations, number of runs, or statistical significance tests are provided; without these, it is difficult to assess whether the improvements are robust or could be explained by prompt-selection variance.

    Authors: We acknowledge the value of statistical reporting. Because CoCo-SAM3 is a deterministic, training-free post-processing step, results are identical across repeated runs for any fixed prompt set; hence standard deviations over random seeds are zero and not reported. The term 'consistent' refers to positive gains on every one of the eight benchmarks under the same protocol. To address prompt-selection variance we will add a short sensitivity study (using three alternative synonymous prompt sets per class) and report the resulting mean and standard deviation in a revised Table 2 or supplementary table. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CoCo-SAM3 as an inference-only post-processing step on top of SAM3. It identifies issues of mask overlap and prompt inconsistency, then proposes decoupling into intra-class prompt alignment/aggregation followed by inter-class competition on a unified evidence scale. No equations, fitted parameters, or derivations are present that reduce any claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claim of consistent benchmark gains rests on the logical application of the described procedure rather than any self-referential redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no new mathematical parameters, axioms, or postulated entities; the contribution is a procedural addition to an existing foundation model.

pith-pipeline@v0.9.0 · 5477 in / 1112 out tokens · 47815 ms · 2026-05-10T03:06:11.271759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training-free open-vocabulary segmentation with offline diffusion-augmented prototype gener- ation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3689–3698 (2024)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Barsellotti,L.,Bianchi,L.,Messina,N.,Carrara,F.,Cornia,M.,Baraldi,L.,Falchi, F., Cucchiara, R.: Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22025–22035 (2025)

  3. [3]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al.: Perception encoder: The best visual em- beddings are not at the output of the network. arXiv preprint arXiv:2504.13181 (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bousselham, W., Petersen, F., Ferrari, V., Kuehne, H.: Grounding everything: Emerging localization properties in vision-language transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3828–3837 (2024)

  5. [5]

    Ad- vances in Neural Information Processing Systems32(2019)

    Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Ad- vances in Neural Information Processing Systems32(2019)

  6. [6]

    In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition

    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 1209–1218 (2018)

  7. [7]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  8. [8]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

  10. [10]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)

  12. [12]

    Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visualobjectclasses(voc)challenge.Internationaljournalofcomputervision88(2), 303–338 (2010)

  13. [13]

    In: European conference on computer vision

    Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)

  14. [14]

    In: Proceedings of the Winter Conference on Applications of Computer Vision

    Hajimiri, S., Ben Ayed, I., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision. pp. 5061–5071 (2025)

  15. [15]

    In: European Conference on Computer Vision

    Huang, Y., Kang, D., Chen, L., Zhe, X., Jia, W., Bao, L., He, X.: Car: Class-aware regularizations for semantic segmentation. In: European Conference on Computer Vision. pp. 518–534. Springer (2022) 16 Y. Chen et al

  16. [16]

    Jin, S., Yu, S., Zhang, B., Sun, M., Dong, Y., Xiao, J.: Feature purification mat- ters: Suppressing outlier propagation for training-free open-vocabulary semantic segmentation.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 20291–20300 (2025)

  17. [17]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kim, C., Ju, D., Han, W., Yang, M.H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15033–15042 (2025)

  18. [18]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  19. [19]

    In: European Conference on Computer Vision

    Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Clearclip: Decomposing clip representations for dense vision-language inference. In: European Conference on Computer Vision. pp. 143–160. Springer (2024)

  20. [20]

    In: European Conference on Computer Vision

    Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 70–88. Springer (2024)

  21. [21]

    Language-driven semantic segmentation,

    Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

  22. [22]

    Target refocusing via attention redistribution for open-vocabulary semantic seg- mentation: An explainability perspective.arXiv preprint arXiv:2511.16170, 2025

    Li, J., Lu, Y., Zhang, Y., Xie, Y., Wang, F., Xie, Y., Qu, Y.: Target refocusing via attention redistribution for open-vocabulary semantic segmentation: An explain- ability perspective. arXiv preprint arXiv:2511.16170 (2025)

  23. [23]

    arXiv e-prints pp

    Li, Y., Wang, H., Duan, Y., Li, X.: Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv e-prints pp. arXiv–2304 (2023)

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Mar- culescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 7061–7070 (2023)

  25. [25]

    Chordedit: One-step low-energy transport for image editing.arXiv preprint arXiv:2602.19083, 2026

    Lu,L.,Chen,X.,Guo,M.,Li,S.,Wang,J.,Shi,Y.:Chordedit:One-steplow-energy transport for image editing. arXiv preprint arXiv:2602.19083 (2026)

  26. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 891–898 (2014)

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  28. [28]

    arXiv preprint arXiv:2602.06333 (2026)

    Pei, G., Jiang, X., Yao, Y., Shu, X., Shen, F., Jeon, B.: Taming sam3 in the wild: A concept bank for open-vocabulary segmentation. arXiv preprint arXiv:2602.06333 (2026)

  29. [29]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  30. [30]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  31. [31]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ru, L., Zheng, H., Zhan, Y., Du, B.: Token contrast for weakly-supervised semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3093–3102 (2023) CoCo-SAM3 for Open-Vocabulary Semantic Segmentation 17

  32. [32]

    In: European Conference on Computer Vision

    Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation. In: European Conference on Computer Vision. pp. 139–156. Springer (2024)

  33. [33]

    MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

    Shi, Y., Xie, Y., Guo, M., Lu, L., Huang, M., Wang, J., Zhu, Z., Xu, B., Huang, Z.: Mmerror: A benchmark for erroneous reasoning in vision-language models. arXiv preprint arXiv:2601.03331 (2026)

  34. [34]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23487–23497 (2025)

  35. [35]

    In: European conference on computer vision

    Wang, F., Mei, J., Yuille, A.: Sclip: Rethinking self-attention for dense vision- language inference. In: European conference on computer vision. pp. 315–332. Springer (2024)

  36. [36]

    In: European Conference on Computer Vision

    Wysoczańska, M., Siméoni, O., Ramamonjisoa, M., Bursuc, A., Trzciński, T., Pérez, P.: Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary se- mantic segmentation. In: European Conference on Computer Vision. pp. 320–337. Springer (2024)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8256– 8265 (2019)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xie, B., Cao, J., Xie, J., Khan, F.S., Pang, Y.: Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3426–3436 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18134– 18144 (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023)

  41. [41]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xuan, X., Deng, Z., Ma, K.L.: Reme: A data-centric framework for training-free open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20954–20965 (2025)

  42. [42]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correlations in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24677–24687 (2025)

  43. [43]

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing throughade20kdataset.In:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition. pp. 633–641 (2017)

  44. [44]

    In: European conference on computer vision

    Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European conference on computer vision. pp. 696–712. Springer (2022) 18 Y. Chen et al. Appendix A Implementation Details For each category, we construct a prompt set consisting of the canonical class name and several synonymous expressions. The synonym set is used only to build the sem...