pith. sign in

arxiv: 2502.06818 · v3 · submitted 2025-02-05 · 💻 cs.LG

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Pith reviewed 2026-05-23 04:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords CLIPopen-vocabulary semantic segmentationtraining-freeattention fusionchannel suppressionglobal contextdense prediction
0
0 comments X

The pith

Fusing attention from CLIP's global-token blocks with query-query attention and applying channel suppression lets the model keep useful global context for training-free open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vanilla CLIP patch features encode too much homogeneous image-level information, which limits their use in dense prediction. Earlier training-free methods improve locality by restricting attention to narrow windows but weaken the model's ability to gather global context. GCLIP instead keeps global knowledge by reshaping the final attention and value embeddings: it merges attention maps from earlier global-token blocks into the query-query attention and uses a channel suppression step on the values. The result is features that combine image-level properties with local detail. Experiments across five benchmarks show this approach beats prior methods.

Core claim

GCLIP extracts and utilizes beneficial global knowledge of CLIP for TF-OVSS by fusing the attention from the global-token emerging blocks with the Query-Query attention and designing a novel channel suppression strategy to reshape the last-block attention and Value embeddings so that final patch features aggregate useful global context.

What carries the argument

Fusion of attention from global-token emerging blocks with query-query attention, plus a channel suppression strategy on value embeddings, which adds image-level properties to the last block without creating uniform patch patterns.

If this is right

  • GCLIP aggregates useful global context into final features while retaining locality benefits.
  • The modified last-block attention and values become more suitable for dense prediction tasks.
  • Performance improves consistently over previous training-free open-vocabulary segmentation methods on five benchmarks.
  • CLIP's original global aggregation ability is preserved rather than sacrificed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion-plus-suppression pattern could be tested on other vision-language models that rely on global tokens.
  • If the channel suppression works by reducing redundant semantic correlations, similar suppression might help in other attention-based dense tasks.
  • The approach suggests that global context can be added back selectively without retraining, which may apply to other dense prediction settings that start from image-level encoders.

Load-bearing premise

That merging early global attention with query-query attention and suppressing channels will give the final attention map image-level properties without creating homogeneous patterns across patches while also making the value embeddings more semantically correlated.

What would settle it

Measure whether removing the fusion step or the channel suppression step causes performance to drop below the prior state-of-the-art on any of the five standard benchmarks.

Figures

Figures reproduced from arXiv: 2502.06818 by Cilin Yan, Guoliang Kang, Jingyun Wang.

Figure 1
Figure 1. Figure 1: Experiments with CLIP ViT-B/16. (a) Emergence of global tokens (best viewed in color). Global tokens (highlight stripes in Line 1) start to emerge from the attention map of block 6. Comparing the attention maps after block 6, we observe the attention pattern of global tokens aligns well with that of the [CLS] token (Line 2&3). (b) Channel Suppression (CS). We observe the entropy of weight norms decreases a… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. (a) Overview. In this paper, we propose a new framework GCLIP, consisting of Attention Map Fusion (AMF) and Channel Suppression (CS), for Training-Free Open-Vocabulary Semantic Segmentation. (b) Attention Map Fusion. We fuse the attentions of early global-token emerging blocks (Lg,Lg+1, · · · ) with the Query-Query attention of the last-block (Lf ) to emphasize the effect of global knowled… view at source ↗
Figure 3
Figure 3. Figure 3: Weight Norms of the second fully-connected layer in FFNs. Starting from block 5 (CLIP ViT-B/16), we observe FFN’s second fully connected layer weight norm corresponding to a spe￾cific output channel becomes unexpectedly larger than the weight norm of other channels. As on different datasets global-token emerging layers may be different, our AMF provides a practical way to au￾tomatically identify the global… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Results. We visualize the segmentation results of GCLIP on both PASCAL VOC and PASCAL Context. We observe that the masks generated by ClearCLIP usually fail to segment the integral target object because it may confuse semantically similar categories without sufficient global context. GCLIP extracts semantically correlated patch-level image features through enhancing global context information. … view at source ↗
read the original abstract

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes GCLIP, a training-free open-vocabulary semantic segmentation method that rethinks CLIP's global knowledge. It argues prior TF-OVSS approaches weaken global context by restricting attention to local windows, and instead fuses attention maps from global-token emerging blocks with Query-Query attention in the last block (to impart image-level properties without homogeneous patch patterns) while applying a novel channel suppression strategy to the Value embeddings (to increase semantic correlation), claiming this aggregates beneficial global context and yields consistent gains over prior SOTA on five standard benchmarks.

Significance. If the central mechanism is validated and the benchmark gains are reproducible, the result would be significant for TF-OVSS: it offers a concrete way to retain CLIP's global aggregation capability while mitigating the homogeneity problem the authors themselves identify, without any training. The design is parameter-free in the sense described and directly targets the attention/Value decomposition of the final features.

major comments (2)
  1. [Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.
  2. [Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that additional empirical support will strengthen the manuscript. We will incorporate the requested analyses in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.

    Authors: We acknowledge the need for direct evidence to validate that the fusion avoids homogeneous patterns. The method section motivates the fusion of global-token emerging block attention with Query-Query attention precisely to combine image-level context with patch-specific distinctions. In the revised version we will add: (1) side-by-side attention-map visualizations on representative images comparing the fused map against the individual components, (2) a quantitative homogeneity metric (standard deviation of attention weights across patches), and (3) an ablation isolating the fusion step. These additions will confirm that the fused map retains useful global context without the homogeneity issue. revision: yes

  2. Referee: [Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.

    Authors: We agree that explicit evidence is required to link gains to improved semantic correlation in the Value embeddings. The channel suppression is designed to increase semantic correlation by attenuating less informative channels. In revision we will report: (1) correlation statistics (average pairwise cosine similarity of Value vectors) before versus after suppression, (2) qualitative embedding comparisons, and (3) an ablation with and without suppression to isolate its contribution. This will allow readers to attribute improvements more confidently to the Value modification. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an empirical architectural proposal without derivations or self-referential reductions

full rationale

The paper proposes GCLIP as a direct design choice—fusing global-token attention with Query-Query attention and applying channel suppression—to reshape last-block attention and Value embeddings for TF-OVSS. No equations, parameter fits, predictions, or derivations appear that reduce by construction to the inputs; the abstract and description frame the approach as a novel proposal justified by benchmark results rather than any self-definitional or fitted-input logic. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained as an empirical engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are identifiable or extractable.

pith-pipeline@v0.9.0 · 5784 in / 1135 out tokens · 32875 ms · 2026-05-23T04:31:25.310983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    Grounding everything: Emerging localiza- tion properties in vision-language transformers

    Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837, 2024. 3, 6

  2. [2]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 3, 5

  3. [3]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

  4. [4]

    Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

    Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 1, 3, 6

  5. [5]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3

  6. [6]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018. 1

  7. [7]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 2

  8. [8]

    Per- pixel classification is not all you need for semantic segmen- tation

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1

  9. [9]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 1, 3

  10. [10]

    Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4113– 4123, 2024. 1, 3

  11. [11]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

  12. [12]

    Virtex: Learning visual representations from textual annotations

    Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021. 2

  13. [13]

    Learning to prompt for open-vocabulary ob- ject detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3

  14. [14]

    The pascal visual object classes (voc) challenge

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. 2, 5

  15. [15]

    Scal- ing open-vocabulary image segmentation with image-level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022. 1, 3

  16. [16]

    Clip-s4: Language-guided self-supervised semantic segmen- tation

    Wenbin He, Suphanut Jamonnak, Liang Gou, and Liu Ren. Clip-s4: Language-guided self-supervised semantic segmen- tation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11207–11216,

  17. [17]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 7

  18. [18]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

  19. [19]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6

  20. [20]

    Clearclip: Decom- posing clip representations for dense vision-language infer- ence

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. arXiv preprint arXiv:2407.12442 , 2024. 1, 2, 3, 6, 7

  21. [21]

    Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, 2024. 6

  22. [22]

    Language-driven semantic seg- mentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. In International Conference on Learning Rep- resentations. 1, 3

  23. [23]

    Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training

    Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 11336– 11344, 2020. 2

  24. [24]

    Align before fuse: Vision and language representation learn- ing with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021

  25. [25]

    Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 2046–2065, 2020. 2

  26. [26]

    Clip surgery for better explainability with enhancement in open- vocabulary tasks

    Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. 1, 3, 6

  27. [27]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 1, 3

  28. [28]

    Semantic image segmentation with deep convolutional nets and fully connected crfs

    Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Inter- national Conference on Learning Representations, 2015. 1

  29. [29]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 1

  30. [30]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 2, 5

  31. [31]

    Slip: Self-supervision meets language-image pre- training

    Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. In European Conference on Computer Vision, pages 529–544. Springer, 2022. 2

  32. [32]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 5, 6, 7

  33. [33]

    Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency

    Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. In The Eleventh In- ternational Conference on Learning Representations. 3

  34. [34]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 6

  36. [36]

    Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2025. 1, 3, 6

  37. [37]

    Yfcc100m: The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

  38. [38]

    Sclip: Rethink- ing self-attention for dense vision-language inference

    Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2025. 1, 3, 6

  39. [39]

    Learn to rectify the bias of clip for unsupervised semantic segmentation

    Jingyun Wang and Guoliang Kang. Learn to rectify the bias of clip for unsupervised semantic segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102–4112, 2024. 3, 6

  40. [40]

    Detecting ev- erything in the open world: Towards universal object detec- tion

    Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Tor- ralba, Hengshuang Zhao, and Shengjin Wang. Detecting ev- erything in the open world: Towards universal object detec- tion. arXiv preprint arXiv:2303.11749, 2023. 3

  41. [41]

    Image-text co- decomposition for text-supervised semantic segmentation

    Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 3

  42. [42]

    Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024

    Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024. 1, 3, 6

  43. [43]

    Demystifying CLIP Data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data. arXiv preprint arXiv:2309.16671, 2023. 7

  44. [44]

    Groupvit: Semantic segmentation emerges from text supervision

    Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18134–18144, 2022. 1, 3, 6

  45. [45]

    Open-vocabulary panop- tic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3

  46. [46]

    A simple framework for text- supervised semantic segmentation

    Muyang Yi, Quan Cui, Hao Wu, Cheng Yang, Osamu Yoshie, and Hongtao Lu. A simple framework for text- supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7071–7080, 2023. 1

  47. [47]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 1

  48. [48]

    B. Zhou, Z. Hang, Francesco Xavier Puig Fernandez, S. Fi- dler, and A. Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 5

  49. [49]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII , pages 696–712. Springer,