Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation
Pith reviewed 2026-05-23 04:31 UTC · model grok-4.3
The pith
Fusing attention from CLIP's global-token blocks with query-query attention and applying channel suppression lets the model keep useful global context for training-free open-vocabulary segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GCLIP extracts and utilizes beneficial global knowledge of CLIP for TF-OVSS by fusing the attention from the global-token emerging blocks with the Query-Query attention and designing a novel channel suppression strategy to reshape the last-block attention and Value embeddings so that final patch features aggregate useful global context.
What carries the argument
Fusion of attention from global-token emerging blocks with query-query attention, plus a channel suppression strategy on value embeddings, which adds image-level properties to the last block without creating uniform patch patterns.
If this is right
- GCLIP aggregates useful global context into final features while retaining locality benefits.
- The modified last-block attention and values become more suitable for dense prediction tasks.
- Performance improves consistently over previous training-free open-vocabulary segmentation methods on five benchmarks.
- CLIP's original global aggregation ability is preserved rather than sacrificed.
Where Pith is reading between the lines
- The same fusion-plus-suppression pattern could be tested on other vision-language models that rely on global tokens.
- If the channel suppression works by reducing redundant semantic correlations, similar suppression might help in other attention-based dense tasks.
- The approach suggests that global context can be added back selectively without retraining, which may apply to other dense prediction settings that start from image-level encoders.
Load-bearing premise
That merging early global attention with query-query attention and suppressing channels will give the final attention map image-level properties without creating homogeneous patterns across patches while also making the value embeddings more semantically correlated.
What would settle it
Measure whether removing the fusion step or the channel suppression step causes performance to drop below the prior state-of-the-art on any of the five standard benchmarks.
Figures
read the original abstract
Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GCLIP, a training-free open-vocabulary semantic segmentation method that rethinks CLIP's global knowledge. It argues prior TF-OVSS approaches weaken global context by restricting attention to local windows, and instead fuses attention maps from global-token emerging blocks with Query-Query attention in the last block (to impart image-level properties without homogeneous patch patterns) while applying a novel channel suppression strategy to the Value embeddings (to increase semantic correlation), claiming this aggregates beneficial global context and yields consistent gains over prior SOTA on five standard benchmarks.
Significance. If the central mechanism is validated and the benchmark gains are reproducible, the result would be significant for TF-OVSS: it offers a concrete way to retain CLIP's global aggregation capability while mitigating the homogeneity problem the authors themselves identify, without any training. The design is parameter-free in the sense described and directly targets the attention/Value decomposition of the final features.
major comments (2)
- [Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.
- [Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and agree that additional empirical support will strengthen the manuscript. We will incorporate the requested analyses in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract (and method description): the load-bearing claim that fusing global-token attention with Query-Query attention equips the last-block map with image-level properties 'while not introducing homogeneous attention patterns across patches' is asserted as a design goal but is not accompanied by any attention-map visualizations, homogeneity metrics, or ablation that isolates whether the fusion actually satisfies the 'without introducing homogeneous patterns' constraint; if the fused map remains homogeneous, the claimed aggregation of useful global context does not occur.
Authors: We acknowledge the need for direct evidence to validate that the fusion avoids homogeneous patterns. The method section motivates the fusion of global-token emerging block attention with Query-Query attention precisely to combine image-level context with patch-specific distinctions. In the revised version we will add: (1) side-by-side attention-map visualizations on representative images comparing the fused map against the individual components, (2) a quantitative homogeneity metric (standard deviation of attention weights across patches), and (3) an ablation isolating the fusion step. These additions will confirm that the fused map retains useful global context without the homogeneity issue. revision: yes
-
Referee: [Abstract] Abstract (and method description): the claim that the channel suppression strategy makes the Value embeddings 'more semantically correlated' is central to attributing performance gains to the proposed mechanism, yet no correlation statistics, Value embedding comparisons, or ablation removing the suppression are reported; without such evidence the benchmark improvements cannot be confidently linked to improved Value semantics rather than other implementation details.
Authors: We agree that explicit evidence is required to link gains to improved semantic correlation in the Value embeddings. The channel suppression is designed to increase semantic correlation by attenuating less informative channels. In revision we will report: (1) correlation statistics (average pairwise cosine similarity of Value vectors) before versus after suppression, (2) qualitative embedding comparisons, and (3) an ablation with and without suppression to isolate its contribution. This will allow readers to attribute improvements more confidently to the Value modification. revision: yes
Circularity Check
No circularity: method is an empirical architectural proposal without derivations or self-referential reductions
full rationale
The paper proposes GCLIP as a direct design choice—fusing global-token attention with Query-Query attention and applying channel suppression—to reshape last-block attention and Value embeddings for TF-OVSS. No equations, parameter fits, predictions, or derivations appear that reduce by construction to the inputs; the abstract and description frame the approach as a novel proposal justified by benchmark results rather than any self-definitional or fitted-input logic. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained as an empirical engineering contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we fuse the attention from the global-token emerging blocks with the Query-Query attention... design a novel channel suppression strategy
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
global tokens... encode the image-level properties as [CLS] token
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Grounding everything: Emerging localiza- tion properties in vision-language transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837, 2024. 3, 6
work page 2024
-
[2]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 3, 5
work page 2018
-
[3]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6
work page 2021
-
[4]
Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 1, 3, 6
work page 2023
-
[5]
Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3
work page 2021
-
[6]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018. 1
work page 2018
-
[7]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 2
work page 2020
-
[8]
Per- pixel classification is not all you need for semantic segmen- tation
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1
work page 2021
-
[9]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 1, 3
work page 2022
-
[10]
Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4113– 4123, 2024. 1, 3
work page 2024
-
[11]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5
work page 2016
-
[12]
Virtex: Learning visual representations from textual annotations
Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021. 2
work page 2021
-
[13]
Learning to prompt for open-vocabulary ob- ject detection with vision-language model
Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3
work page 2022
-
[14]
The pascal visual object classes (voc) challenge
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. 2, 5
work page 2010
-
[15]
Scal- ing open-vocabulary image segmentation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022. 1, 3
work page 2022
-
[16]
Clip-s4: Language-guided self-supervised semantic segmen- tation
Wenbin He, Suphanut Jamonnak, Liang Gou, and Liu Ren. Clip-s4: Language-guided self-supervised semantic segmen- tation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 11207–11216,
-
[17]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 7
work page 2021
-
[18]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,
-
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6
work page 2023
-
[20]
Clearclip: Decom- posing clip representations for dense vision-language infer- ence
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. arXiv preprint arXiv:2407.12442 , 2024. 1, 2, 3, 6, 7
-
[21]
Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, 2024. 6
work page 2024
-
[22]
Language-driven semantic seg- mentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. In International Conference on Learning Rep- resentations. 1, 3
-
[23]
Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and 9 language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 11336– 11344, 2020. 2
work page 2020
-
[24]
Align before fuse: Vision and language representation learn- ing with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021
work page 2021
-
[25]
Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 2046–2065, 2020. 2
work page 2020
-
[26]
Clip surgery for better explainability with enhancement in open- vocabulary tasks
Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. 1, 3, 6
-
[27]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 1, 3
work page 2023
-
[28]
Semantic image segmentation with deep convolutional nets and fully connected crfs
Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Inter- national Conference on Learning Representations, 2015. 1
work page 2015
-
[29]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 1
work page 2015
-
[30]
The role of context for object detection and semantic segmentation in the wild
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 2, 5
work page 2014
-
[31]
Slip: Self-supervision meets language-image pre- training
Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. In European Conference on Computer Vision, pages 529–544. Springer, 2022. 2
work page 2022
-
[32]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 5, 6, 7
work page 2021
-
[33]
Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency
Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. In The Eleventh In- ternational Conference on Learning Representations. 3
-
[34]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 6
work page 2022
-
[36]
Ex- plore the potential of clip for training-free open vocabulary semantic segmentation
Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. In European Conference on Com- puter Vision, pages 139–156. Springer, 2025. 1, 3, 6
work page 2025
-
[37]
Yfcc100m: The new data in multimedia research
Bart Thomee, David A Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 3
work page 2016
-
[38]
Sclip: Rethink- ing self-attention for dense vision-language inference
Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2025. 1, 3, 6
work page 2025
-
[39]
Learn to rectify the bias of clip for unsupervised semantic segmentation
Jingyun Wang and Guoliang Kang. Learn to rectify the bias of clip for unsupervised semantic segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102–4112, 2024. 3, 6
work page 2024
-
[40]
Detecting ev- erything in the open world: Towards universal object detec- tion
Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Tor- ralba, Hengshuang Zhao, and Shengjin Wang. Detecting ev- erything in the open world: Towards universal object detec- tion. arXiv preprint arXiv:2303.11749, 2023. 3
-
[41]
Image-text co- decomposition for text-supervised semantic segmentation
Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 3
work page 2024
-
[42]
Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation.Ad- vances in Neural Information Processing Systems, 36, 2024. 1, 3, 6
work page 2024
-
[43]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data. arXiv preprint arXiv:2309.16671, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Groupvit: Semantic segmentation emerges from text supervision
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18134–18144, 2022. 1, 3, 6
work page 2022
-
[45]
Open-vocabulary panop- tic segmentation with text-to-image diffusion models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3
work page 2023
-
[46]
A simple framework for text- supervised semantic segmentation
Muyang Yi, Quan Cui, Hao Wu, Cheng Yang, Osamu Yoshie, and Hongtao Lu. A simple framework for text- supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7071–7080, 2023. 1
work page 2023
-
[47]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 1
work page 2017
-
[48]
B. Zhou, Z. Hang, Francesco Xavier Puig Fernandez, S. Fi- dler, and A. Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 5
work page 2017
-
[49]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII , pages 696–712. Springer,
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.