Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Cilin Yan; Guoliang Kang; Jingyun Wang

arxiv: 2502.06818 · v3 · submitted 2025-02-05 · 💻 cs.LG

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang , Cilin Yan , Guoliang Kang This is my paper

Pith reviewed 2026-05-23 04:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords CLIPopen-vocabulary semantic segmentationtraining-freeattention fusionchannel suppressionglobal contextdense prediction

0 comments

The pith

Fusing attention from CLIP's global-token blocks with query-query attention and applying channel suppression lets the model keep useful global context for training-free open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vanilla CLIP patch features encode too much homogeneous image-level information, which limits their use in dense prediction. Earlier training-free methods improve locality by restricting attention to narrow windows but weaken the model's ability to gather global context. GCLIP instead keeps global knowledge by reshaping the final attention and value embeddings: it merges attention maps from earlier global-token blocks into the query-query attention and uses a channel suppression step on the values. The result is features that combine image-level properties with local detail. Experiments across five benchmarks show this approach beats prior methods.

Core claim

GCLIP extracts and utilizes beneficial global knowledge of CLIP for TF-OVSS by fusing the attention from the global-token emerging blocks with the Query-Query attention and designing a novel channel suppression strategy to reshape the last-block attention and Value embeddings so that final patch features aggregate useful global context.

What carries the argument

Fusion of attention from global-token emerging blocks with query-query attention, plus a channel suppression strategy on value embeddings, which adds image-level properties to the last block without creating uniform patch patterns.

If this is right

GCLIP aggregates useful global context into final features while retaining locality benefits.
The modified last-block attention and values become more suitable for dense prediction tasks.
Performance improves consistently over previous training-free open-vocabulary segmentation methods on five benchmarks.
CLIP's original global aggregation ability is preserved rather than sacrificed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-plus-suppression pattern could be tested on other vision-language models that rely on global tokens.
If the channel suppression works by reducing redundant semantic correlations, similar suppression might help in other attention-based dense tasks.
The approach suggests that global context can be added back selectively without retraining, which may apply to other dense prediction settings that start from image-level encoders.

Load-bearing premise

That merging early global attention with query-query attention and suppressing channels will give the final attention map image-level properties without creating homogeneous patterns across patches while also making the value embeddings more semantically correlated.

What would settle it

Measure whether removing the fusion step or the channel suppression step causes performance to drop below the prior state-of-the-art on any of the five standard benchmarks.

Figures

Figures reproduced from arXiv: 2502.06818 by Cilin Yan, Guoliang Kang, Jingyun Wang.

**Figure 1.** Figure 1: Experiments with CLIP ViT-B/16. (a) Emergence of global tokens (best viewed in color). Global tokens (highlight stripes in Line 1) start to emerge from the attention map of block 6. Comparing the attention maps after block 6, we observe the attention pattern of global tokens aligns well with that of the [CLS] token (Line 2&3). (b) Channel Suppression (CS). We observe the entropy of weight norms decreases a… view at source ↗

**Figure 2.** Figure 2: Method Overview. (a) Overview. In this paper, we propose a new framework GCLIP, consisting of Attention Map Fusion (AMF) and Channel Suppression (CS), for Training-Free Open-Vocabulary Semantic Segmentation. (b) Attention Map Fusion. We fuse the attentions of early global-token emerging blocks (Lg,Lg+1, · · · ) with the Query-Query attention of the last-block (Lf ) to emphasize the effect of global knowled… view at source ↗

**Figure 3.** Figure 3: Weight Norms of the second fully-connected layer in FFNs. Starting from block 5 (CLIP ViT-B/16), we observe FFN’s second fully connected layer weight norm corresponding to a specific output channel becomes unexpectedly larger than the weight norm of other channels. As on different datasets global-token emerging layers may be different, our AMF provides a practical way to automatically identify the global… view at source ↗

**Figure 4.** Figure 4: Qualitative Results. We visualize the segmentation results of GCLIP on both PASCAL VOC and PASCAL Context. We observe that the masks generated by ClearCLIP usually fail to segment the integral target object because it may confuse semantically similar categories without sufficient global context. GCLIP extracts semantically correlated patch-level image features through enhancing global context information. … view at source ↗

read the original abstract

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCLIP keeps global context in CLIP for TF-OVSS via attention fusion and channel suppression, but the mechanism lacks direct validation beyond final benchmark scores.

read the letter

The paper's core move is to stop sacrificing CLIP's global aggregation when adapting it for training-free open-vocabulary segmentation. Prior work made patches attend mostly locally to gain locality, which hurt the model's ability to pull in image-level information. GCLIP instead fuses attention from earlier global-token blocks with the last block's Query-Query attention and adds a channel suppression step on the Value embeddings. This is the concrete proposal, and the abstract says it produces consistent gains over prior methods on five standard benchmarks.

Referee Report

2 major / 0 minor

Summary. The paper proposes GCLIP, a training-free open-vocabulary semantic segmentation method that rethinks CLIP's global knowledge. It argues prior TF-OVSS approaches weaken global context by restricting attention to local windows, and instead fuses attention maps from global-token emerging blocks with Query-Query attention in the last block (to impart image-level properties without homogeneous patch patterns) while applying a novel channel suppression strategy to the Value embeddings (to increase semantic correlation), claiming this aggregates beneficial global context and yields consistent gains over prior SOTA on five standard benchmarks.

Significance. If the central mechanism is validated and the benchmark gains are reproducible, the result would be significant for TF-OVSS: it offers a concrete way to retain CLIP's global aggregation capability while mitigating the homogeneity problem the authors themselves identify, without any training. The design is parameter-free in the sense described and directly targets the attention/Value decomposition of the final features.

major comments (2)

[Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.
[Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that additional empirical support will strengthen the manuscript. We will incorporate the requested analyses in the revision.

read point-by-point responses

Referee: [Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.

Authors: We acknowledge the need for direct evidence to validate that the fusion avoids homogeneous patterns. The method section motivates the fusion of global-token emerging block attention with Query-Query attention precisely to combine image-level context with patch-specific distinctions. In the revised version we will add: (1) side-by-side attention-map visualizations on representative images comparing the fused map against the individual components, (2) a quantitative homogeneity metric (standard deviation of attention weights across patches), and (3) an ablation isolating the fusion step. These additions will confirm that the fused map retains useful global context without the homogeneity issue. revision: yes
Referee: [Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.

Authors: We agree that explicit evidence is required to link gains to improved semantic correlation in the Value embeddings. The channel suppression is designed to increase semantic correlation by attenuating less informative channels. In revision we will report: (1) correlation statistics (average pairwise cosine similarity of Value vectors) before versus after suppression, (2) qualitative embedding comparisons, and (3) an ablation with and without suppression to isolate its contribution. This will allow readers to attribute improvements more confidently to the Value modification. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an empirical architectural proposal without derivations or self-referential reductions

full rationale

The paper proposes GCLIP as a direct design choice—fusing global-token attention with Query-Query attention and applying channel suppression—to reshape last-block attention and Value embeddings for TF-OVSS. No equations, parameter fits, predictions, or derivations appear that reduce by construction to the inputs; the abstract and description frame the approach as a novel proposal justified by benchmark results rather than any self-definitional or fitted-input logic. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained as an empirical engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are identifiable or extractable.

pith-pipeline@v0.9.0 · 5784 in / 1135 out tokens · 32875 ms · 2026-05-23T04:31:25.310983+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we fuse the attention from the global-token emerging blocks with the Query-Query attention... design a novel channel suppression strategy
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global tokens... encode the image-level properties as [CLS] token

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

Grounding everything: Emerging localiza- tion properties in vision-language transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837, 2024. 3, 6

work page 2024
[2]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 3, 5

work page 2018
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

work page 2021
[4]

Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 1, 3, 6

work page 2023
[5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3

work page 2021
[6]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018. 1

work page 2018
[7]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 2

work page 2020
[8]

Per- pixel classification is not all you need for semantic segmen- tation

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1

work page 2021
[9]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 1, 3

work page 2022
[10]

Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4113– 4123, 2024. 1, 3

work page 2024
[11]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

work page 2016
[12]

Virtex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021. 2

work page 2021
[13]

Learning to prompt for open-vocabulary ob- ject detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3

work page 2022
[14]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. 2, 5

work page 2010
[15]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022. 1, 3

work page 2022
[16]

Clip-s4: Language-guided self-supervised semantic segmen- tation

Wenbin He, Suphanut Jamonnak, Liang Gou, and Liu Ren. Clip-s4: Language-guided self-supervised semantic segmen- tation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11207–11216,

work page
[17]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 7

work page 2021
[18]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

work page
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6

work page 2023
[20]

Clearclip: Decom- posing clip representations for dense vision-language infer- ence

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. arXiv preprint arXiv:2407.12442 , 2024. 1, 2, 3, 6, 7

work page arXiv 2024
[21]

Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, 2024. 6

work page 2024
[22]

Language-driven semantic seg- mentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. In International Conference on Learning Rep- resentations. 1, 3

work page
[23]

Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 11336– 11344, 2020. 2

work page 2020
[24]

Align before fuse: Vision and language representation learn- ing with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021

work page 2021
[25]

Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 2046–2065, 2020. 2

work page 2020
[26]

Clip surgery for better explainability with enhancement in open- vocabulary tasks

Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. 1, 3, 6

work page arXiv 2023
[27]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 1, 3

work page 2023
[28]

Semantic image segmentation with deep convolutional nets and fully connected crfs

Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Inter- national Conference on Learning Representations, 2015. 1

work page 2015
[29]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 1

work page 2015
[30]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 2, 5

work page 2014
[31]

Slip: Self-supervision meets language-image pre- training

Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. In European Conference on Computer Vision, pages 529–544. Springer, 2022. 2

work page 2022
[32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 5, 6, 7

work page 2021
[33]

Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency

Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. In The Eleventh In- ternational Conference on Learning Representations. 3

work page
[34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 6

work page 2022
[36]

Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2025. 1, 3, 6

work page 2025
[37]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

work page 2016
[38]

Sclip: Rethink- ing self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2025. 1, 3, 6

work page 2025
[39]

Learn to rectify the bias of clip for unsupervised semantic segmentation

Jingyun Wang and Guoliang Kang. Learn to rectify the bias of clip for unsupervised semantic segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102–4112, 2024. 3, 6

work page 2024
[40]

Detecting ev- erything in the open world: Towards universal object detec- tion

Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Tor- ralba, Hengshuang Zhao, and Shengjin Wang. Detecting ev- erything in the open world: Towards universal object detec- tion. arXiv preprint arXiv:2303.11749, 2023. 3

work page arXiv 2023
[41]

Image-text co- decomposition for text-supervised semantic segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 3

work page 2024
[42]

Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024

Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024. 1, 3, 6

work page 2024
[43]

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data. arXiv preprint arXiv:2309.16671, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Groupvit: Semantic segmentation emerges from text supervision

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18134–18144, 2022. 1, 3, 6

work page 2022
[45]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3

work page 2023
[46]

A simple framework for text- supervised semantic segmentation

Muyang Yi, Quan Cui, Hao Wu, Cheng Yang, Osamu Yoshie, and Hongtao Lu. A simple framework for text- supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7071–7080, 2023. 1

work page 2023
[47]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 1

work page 2017
[48]

B. Zhou, Z. Hang, Francesco Xavier Puig Fernandez, S. Fi- dler, and A. Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 5

work page 2017
[49]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII , pages 696–712. Springer,

work page 2022

[1] [1]

Grounding everything: Emerging localiza- tion properties in vision-language transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837, 2024. 3, 6

work page 2024

[2] [2]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 3, 5

work page 2018

[3] [3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

work page 2021

[4] [4]

Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 1, 3, 6

work page 2023

[5] [5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3

work page 2021

[6] [6]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018. 1

work page 2018

[7] [7]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 2

work page 2020

[8] [8]

Per- pixel classification is not all you need for semantic segmen- tation

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1

work page 2021

[9] [9]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 1, 3

work page 2022

[10] [10]

Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4113– 4123, 2024. 1, 3

work page 2024

[11] [11]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

work page 2016

[12] [12]

Virtex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021. 2

work page 2021

[13] [13]

Learning to prompt for open-vocabulary ob- ject detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3

work page 2022

[14] [14]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. 2, 5

work page 2010

[15] [15]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022. 1, 3

work page 2022

[16] [16]

Clip-s4: Language-guided self-supervised semantic segmen- tation

Wenbin He, Suphanut Jamonnak, Liang Gou, and Liu Ren. Clip-s4: Language-guided self-supervised semantic segmen- tation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11207–11216,

work page

[17] [17]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 7

work page 2021

[18] [18]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

work page

[19] [19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6

work page 2023

[20] [20]

Clearclip: Decom- posing clip representations for dense vision-language infer- ence

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. arXiv preprint arXiv:2407.12442 , 2024. 1, 2, 3, 6, 7

work page arXiv 2024

[21] [21]

Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, 2024. 6

work page 2024

[22] [22]

Language-driven semantic seg- mentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. In International Conference on Learning Rep- resentations. 1, 3

work page

[23] [23]

Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 11336– 11344, 2020. 2

work page 2020

[24] [24]

Align before fuse: Vision and language representation learn- ing with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021

work page 2021

[25] [25]

Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 2046–2065, 2020. 2

work page 2020

[26] [26]

Clip surgery for better explainability with enhancement in open- vocabulary tasks

Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. 1, 3, 6

work page arXiv 2023

[27] [27]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 1, 3

work page 2023

[28] [28]

Semantic image segmentation with deep convolutional nets and fully connected crfs

Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Inter- national Conference on Learning Representations, 2015. 1

work page 2015

[29] [29]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 1

work page 2015

[30] [30]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 2, 5

work page 2014

[31] [31]

Slip: Self-supervision meets language-image pre- training

Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. In European Conference on Computer Vision, pages 529–544. Springer, 2022. 2

work page 2022

[32] [32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 5, 6, 7

work page 2021

[33] [33]

Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency

Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. In The Eleventh In- ternational Conference on Learning Representations. 3

work page

[34] [34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022

[35] [35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 6

work page 2022

[36] [36]

Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2025. 1, 3, 6

work page 2025

[37] [37]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3

work page 2016

[38] [38]

Sclip: Rethink- ing self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2025. 1, 3, 6

work page 2025

[39] [39]

Learn to rectify the bias of clip for unsupervised semantic segmentation

Jingyun Wang and Guoliang Kang. Learn to rectify the bias of clip for unsupervised semantic segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102–4112, 2024. 3, 6

work page 2024

[40] [40]

Detecting ev- erything in the open world: Towards universal object detec- tion

Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Tor- ralba, Hengshuang Zhao, and Shengjin Wang. Detecting ev- erything in the open world: Towards universal object detec- tion. arXiv preprint arXiv:2303.11749, 2023. 3

work page arXiv 2023

[41] [41]

Image-text co- decomposition for text-supervised semantic segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 3

work page 2024

[42] [42]

Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024

Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024. 1, 3, 6

work page 2024

[43] [43]

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data. arXiv preprint arXiv:2309.16671, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Groupvit: Semantic segmentation emerges from text supervision

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18134–18144, 2022. 1, 3, 6

work page 2022

[45] [45]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3

work page 2023

[46] [46]

A simple framework for text- supervised semantic segmentation

Muyang Yi, Quan Cui, Hao Wu, Cheng Yang, Osamu Yoshie, and Hongtao Lu. A simple framework for text- supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7071–7080, 2023. 1

work page 2023

[47] [47]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 1

work page 2017

[48] [48]

B. Zhou, Z. Hang, Francesco Xavier Puig Fernandez, S. Fi- dler, and A. Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 5

work page 2017

[49] [49]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII , pages 696–712. Springer,

work page 2022