ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Pith reviewed 2026-05-23 21:39 UTC · model grok-4.3
The pith
CLIP's class-preference and space-preference biases can be explicitly modeled with a reference prompt and positional embeddings, then rectified by subtracting a generated bias logit map to enable better unsupervised semantic segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by independently encoding class-preference bias into a Reference feature and space-preference bias into a positional feature, generating a bias logit map via their matrix multiplication, subtracting this map from CLIP logits, and processing the result through a mask decoder with contrastive supervision, the bias in CLIP is rectified to produce accurate unsupervised semantic segmentation.
What carries the argument
The bias logit map, created by matrix multiplication of the Reference feature and positional feature, which is subtracted element-wise from CLIP logits to rectify biases before mask decoding.
If this is right
- The method achieves favorable performance against prior state-of-the-art on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff.
- The contrastive loss ensures the bias rectification process is meaningful by aligning visual and text features.
- The mask decoder smooths the rectified logits into contextual segmentation masks.
- Separate encoding of biases avoids interference between class and space preferences.
Where Pith is reading between the lines
- Similar bias modeling might apply to other dense prediction tasks using CLIP-like models.
- Exploring different ways to combine the bias features could yield further improvements.
- The approach suggests that explicit bias correction could enhance CLIP in low-supervision scenarios beyond segmentation.
Load-bearing premise
The assumption that independently encoding the two biases into a Reference feature and a positional feature, then combining them via matrix multiplication, produces a bias logit map whose subtraction from CLIP logits yields a meaningful rectification.
What would settle it
If experiments show that the bias logit map subtraction does not lead to improved or even degrades segmentation accuracy on the mentioned benchmarks compared to using raw CLIP logits, the rectification approach would be falsified.
read the original abstract
Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that CLIP exhibits class-preference and space-preference biases when applied to unsupervised semantic segmentation, and proposes ReCLIP++ to explicitly model and rectify them: a learnable Reference prompt encodes class bias into a Reference feature, a projection of ViT positional embeddings encodes space bias into a positional feature, their matrix product yields a bias logit map, which is subtracted element-wise from CLIP logits; a mask decoder with Gumbel-Softmax then produces the final mask, trained with a contrastive loss on masked visual and text features. Experiments on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff report favorable results against prior SOTA, with code released.
Significance. If the bias logit map construction can be shown to isolate the claimed biases rather than serving as an arbitrary learned correction, the explicit rectification approach would strengthen the use of CLIP for dense prediction tasks and provide a reusable template for bias analysis in vision-language models. The multi-benchmark evaluation and open implementation are positive contributions.
major comments (2)
- [abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.
- [abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.
minor comments (2)
- [experiments] The paper should include visualizations of the generated bias logit maps and their alignment with expected class/space biases to support the modeling claim.
- [method] Notation for the Reference prompt parameters and the positional projection should be formalized with equations to clarify independence of the two bias encodings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline planned revisions to strengthen the presentation of the bias rectification approach.
read point-by-point responses
-
Referee: [abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.
Authors: The architecture explicitly separates the inputs to the two features—the learnable Reference prompt is optimized to capture class-preference bias while the projection of ViT positional embeddings targets space-preference bias—before their outer product forms the bias logit map that is subtracted from CLIP logits. This separation is intended to model the two biases independently rather than as a generic correction. We acknowledge that the formulation alone does not mathematically guarantee isolation from arbitrary adjustments. To provide the requested evidence, we will add correlation analysis between the subtracted bias maps and CLIP behavior on uniform regions and class-imbalanced patches in the revised manuscript. revision: yes
-
Referee: [abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.
Authors: The contrastive loss is computed on masked visual features (produced after rectification and Gumbel-Softmax decoding) against class text features, thereby providing end-to-end supervision that includes the rectification step. The current manuscript does not contain a formal derivation or an ablation that isolates supervision specifically on the bias components versus the mask decoder. We will add such an ablation study in the revision to clarify the contribution of the rectification step. revision: yes
Circularity Check
No circularity: new architecture and loss evaluated externally
full rationale
The paper introduces a learnable Reference prompt, positional projection, matrix-multiplication bias map, element-wise subtraction, mask decoder with Gumbel-Softmax, and contrastive loss on masked features. These are trained end-to-end and evaluated on held-out benchmarks (PASCAL VOC, ADE20K, etc.). No equation reduces a reported quantity to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise for the rectification step, and the contrastive objective is a standard training signal rather than a tautological renaming of inputs. The derivation chain is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable Reference prompt parameters
axioms (1)
- domain assumption CLIP exhibits class-preference bias and space-preference bias that can be modeled independently without interference
invented entities (1)
-
bias logit map
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,
G. Kang, Y . Wei, Y . Yang, Y . Zhuang, and A. Hauptmann, “Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,” Advances in neural informa- tion processing systems, vol. 33, pp. 3569–3580, 2020
work page 2020
-
[2]
Contrastive adaptation net- work for single-and multi-source domain adap- tation,
G. Kang, L. Jiang, Y . Wei, Y . Yang, and A. Hauptmann, “Contrastive adaptation net- work for single-and multi-source domain adap- tation,” IEEE transactions on pattern analysis and machine intelligence , vol. 44, no. 4, pp. 1793–1804, 2020
work page 2020
-
[3]
Llmformer: large language model for open-vocabulary semantic segmentation,
H. Shi, S. D. Dao, and J. Cai, “Llmformer: large language model for open-vocabulary semantic segmentation,” International Journal of Com- puter Vision, pp. 1–18, 2024
work page 2024
-
[4]
Multi-modal prototypes for open-world semantic segmentation,
Y . Yang, C. Ma, C. Ju, F. Zhang, J. Yao, Y . Zhang, and Y . Wang, “Multi-modal prototypes for open-world semantic segmentation,” Inter- national Journal of Computer Vision , pp. 1–17, 2024
work page 2024
-
[5]
Ov-vis: Open-vocabulary video instance segmentation,
H. Wang, C. Yan, K. Chen, X. Jiang, X. Tang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Ov-vis: Open-vocabulary video instance segmentation,” International Journal of Computer Vision, pp. 1– 18, 2024
work page 2024
-
[6]
Pyramid scene parsing network,
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890
work page 2017
-
[7]
Eca-net: Efficient channel attention for deep convolutional neural networks,
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Pro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2020, pp. 11 534–11 542
work page 2020
-
[8]
Per- pixel classification is not all you need for seman- tic segmentation,
B. Cheng, A. Schwing, and A. Kirillov, “Per- pixel classification is not all you need for seman- tic segmentation,” Advances in Neural Informa- tion Processing Systems , vol. 34, pp. 17 864– 17 875, 2021
work page 2021
-
[9]
Masked-attention mask 13 transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kir- illov, and R. Girdhar, “Masked-attention mask 13 transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299
work page 2022
-
[10]
Invariant information clustering for unsupervised image classification and segmentation,
X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classification and segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9865–9874
work page 2019
-
[11]
Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,
J. H. Cho, U. Mall, K. Bala, and B. Hariha- ran, “Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, 2021, pp. 16 794–16 804
work page 2021
-
[12]
T.-W. Ke, J.-J. Hwang, Y . Guo, X. Wang, and S. X. Yu, “Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2571–2581
work page 2022
-
[13]
L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8364– 8375
work page 2022
-
[14]
Autore- gressive unsupervised image segmentation,
Y . Ouali, C. Hudelot, and M. Tami, “Autore- gressive unsupervised image segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 2020, pp. 142–158
work page 2020
-
[15]
Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,
K. Li, Z. Wang, Z. Cheng, R. Yu, Y . Zhao, G. Song, C. Liu, L. Yuan, and J. Chen, “Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7162–7172
work page 2023
-
[16]
Unsupervised seman- tic segmentation by contrasting object mask pro- posals,
W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, and L. Van Gool, “Unsupervised seman- tic segmentation by contrasting object mask pro- posals,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2021, pp. 10 052–10 062
work page 2021
-
[17]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738
work page 2020
-
[18]
Segsort: Segmentation by discriminative sorting of seg- ments,
J.-J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T.- J. Yang, X. Zhang, and L.-C. Chen, “Segsort: Segmentation by discriminative sorting of seg- ments,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2019, pp. 7334–7344
work page 2019
-
[19]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763
work page 2021
-
[20]
Extract free dense labels from clip,
C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in Computer Vision– ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, 2022, pp. 696–712
work page 2022
-
[21]
Reco: Retrieve and co-segment for zero-shot transfer,
G. Shin, W. Xie, and S. Albanie, “Reco: Retrieve and co-segment for zero-shot transfer,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 33 754–33 767, 2022
work page 2022
-
[22]
Perceptual grouping in contrastive vision-language models,
K. Ranasinghe, B. McKinzie, S. Ravi, Y . Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language models,” in Pro- ceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 5571–5584
work page 2023
-
[23]
Clip- s4: Language-guided self-supervised semantic segmentation,
W. He, S. Jamonnak, L. Gou, and L. Ren, “Clip- s4: Language-guided self-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 207–11 216
work page 2023
-
[24]
Learn to rectify the bias of clip for unsupervised semantic segmentation,
J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4102–4112. 14
work page 2024
-
[25]
The pascal visual object classes (voc) challenge,
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010
work page 2010
-
[26]
The role of context for object detection and semantic segmentation in the wild,
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 891–898
work page 2014
-
[27]
Scene parsing through ade20k dataset,
B. Zhou, Z. Hang, F. X. P. Fernandez, S. Fidler, and A. Torralba, “Scene parsing through ade20k dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[28]
The cityscapes dataset for semantic urban scene understanding,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223
work page 2016
-
[29]
Coco- stuff: Thing and stuff classes in context,
H. Caesar, J. Uijlings, and V . Ferrari, “Coco- stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 1209–1218
work page 2018
-
[30]
Uniter: Universal image-text representation learning,
Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” inEuropean Conference on Computer Vision, 2020, pp. 104– 120
work page 2020
-
[31]
Virtex: Learning visual representations from textual annotations,
K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 11 162–11 173
work page 2021
-
[32]
Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,
G. Li, N. Duan, Y . Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Pro- ceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 336– 11 344
work page 2020
-
[33]
Align before fuse: Vision and language representation learn- ing with momentum distillation,
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learn- ing with momentum distillation,” Advances in neural information processing systems , vol. 34, pp. 9694–9705, 2021
work page 2021
-
[34]
Hero: Hierarchical encoder for video+ language omni-representation pre-training,
L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+ language omni-representation pre-training,” in Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), 2020, pp. 2046–2065
work page 2020
-
[35]
Scaling up visual and vision- language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904– 4916
work page 2021
-
[36]
Slip: Self-supervision meets language-image pre-training,
N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European Conference on Com- puter Vision. Springer, 2022, pp. 529–544
work page 2022
-
[37]
Learning to prompt for open-vocabulary object detection with vision-language model,
Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093
work page 2022
-
[38]
Detecting everything in the open world: Towards universal object detec- tion,
Z. Wang, Y . Li, X. Chen, S.-N. Lim, A. Torralba, H. Zhao, and S. Wang, “Detecting everything in the open world: Towards universal object detec- tion,”arXiv preprint arXiv:2303.11749, 2023
-
[39]
J. Cha, J. Mun, and B. Roh, “Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 165–11 174
work page 2023
-
[40]
Side adapter network for open-vocabulary semantic segmentation,
M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. 15
work page 2023
-
[41]
J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 4981–4990
work page 2018
-
[42]
Weakly supervised semantic segmentation by pixel- to-prototype contrast,
Y . Du, Z. Fu, Q. Liu, and Y . Wang, “Weakly supervised semantic segmentation by pixel- to-prototype contrast,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329
work page 2022
-
[43]
From image- level to pixel-level labeling with convolutional networks,
P. O. Pinheiro and R. Collobert, “From image- level to pixel-level labeling with convolutional networks,” in Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 1713–1721
work page 2015
-
[44]
J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 989–998
work page 2022
-
[45]
Scan: Learning to classify images without labels,
W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X. Springer, 2020, pp. 268– 285
work page 2020
-
[46]
M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, in The Tenth International Conference on Learning Rep- resentations, ICLR 2022, Virtual Event, April 25-29, 2022
work page 2022
-
[47]
Emergence of object segmentation in perturbed generative models,
A. Bielski and P. Favaro, “Emergence of object segmentation in perturbed generative models,” Advances in Neural Information Processing Sys- tems, vol. 32, 2019
work page 2019
-
[48]
Unsu- pervised object segmentation by redrawing,
M. Chen, T. Arti `eres, and L. Denoyer, “Unsu- pervised object segmentation by redrawing,” Advances in neural information processing sys- tems, vol. 32, 2019
work page 2019
-
[49]
Gener- ative adversarial networks: An overview,
A. Creswell, T. White, V . Dumoulin, K. Arulku- maran, B. Sengupta, and A. A. Bharath, “Gener- ative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018
work page 2018
-
[50]
Learning neural eigenfunc- tions for unsupervised semantic segmentation,
Z. Deng and Y . Luo, “Learning neural eigenfunc- tions for unsupervised semantic segmentation,” arXiv preprint arXiv:2304.02841, 2023
-
[51]
The hungarian method for the assignment problem,
H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955
work page 1955
-
[52]
Emerg- ing properties in self-supervised vision trans- formers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerg- ing properties in self-supervised vision trans- formers,” inProceedings of the IEEE/CVF inter- national conference on computer vision , 2021, pp. 9650–9660
work page 2021
-
[53]
Zero-shot semantic segmentation,
M. Bucher, T.-H. Vu, M. Cord, and P. P ´erez, “Zero-shot semantic segmentation,”Advances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[54]
Semantic projection network for zero- and few-label semantic segmentation,
Y . Xian, S. Choudhury, Y . He, B. Schiele, and Z. Akata, “Semantic projection network for zero- and few-label semantic segmentation,” in Pro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 8256–8265
work page 2019
-
[55]
A closer look at self- training for zero-label semantic segmentation,
G. Pastore, F. Cermelli, Y . Xian, M. Mancini, Z. Akata, and B. Caputo, “A closer look at self- training for zero-label semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2693–2702
work page 2021
-
[56]
Language-driven semantic segmentation,
B. Li, K. Q. Weinberger, S. J. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in The Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022
work page 2022
-
[57]
Groupvit: Semantic segmentation emerges from text supervision,
J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 16 pp. 18 134–18 144
work page 2022
-
[58]
Scal- ing open-vocabulary image segmentation with image-level labels,
G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin, “Scal- ing open-vocabulary image segmentation with image-level labels,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 540–557
work page 2022
-
[59]
Learning open-vocabulary semantic segmentation models from natural language supervision,
J. Xu, J. Hou, Y . Zhang, R. Feng, Y . Wang, Y . Qiao, and W. Xie, “Learning open-vocabulary semantic segmentation models from natural language supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2935–2944
work page 2023
-
[60]
Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,
P. Ren, C. Li, H. Xu, Y . Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,”arXiv preprint arXiv:2302.10307, 2023
-
[61]
Open-vocabulary semantic segmenta- tion with mask-adapted clip,
F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Mar- culescu, “Open-vocabulary semantic segmenta- tion with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070
work page 2023
-
[62]
Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,
Y . Xing, J. Kang, A. Xiao, J. Nie, L. Shao, and S. Lu, “Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,” Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[63]
Clearclip: Decomposing clip repre- sentations for dense vision-language inference,
M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip repre- sentations for dense vision-language inference,” arXiv preprint arXiv:2407.12442, 2024
-
[64]
Sclip: Rethink- ing self-attention for dense vision-language inference,
F. Wang, J. Mei, and A. Yuille, “Sclip: Rethink- ing self-attention for dense vision-language inference,” in European Conference on Com- puter Vision. Springer, 2025, pp. 315–332
work page 2025
-
[65]
Learning to prompt for vision-language mod- els,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language mod- els,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022
work page 2022
-
[66]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Represen- tations, 2021
work page 2021
-
[67]
Categorical reparameterization with gumbel-softmax,
E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in5th International Conference on Learning Represen- tations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017
work page 2017
-
[68]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018
work page 2018
-
[69]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[70]
A stochastic approx- imation method,
H. Robbins and S. Monro, “A stochastic approx- imation method,” The annals of mathematical statistics, pp. 400–407, 1951. 17
work page 1951
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.