ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Guoliang Kang; Jingyun Wang

arxiv: 2408.06747 · v4 · submitted 2024-08-13 · 💻 cs.CV

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang , Guoliang Kang This is my paper

Pith reviewed 2026-05-23 21:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised semantic segmentationCLIPbias rectificationreference promptpositional embeddingcontrastive learningmask decoderGumbel-Softmax

0 comments

The pith

CLIP's class-preference and space-preference biases can be explicitly modeled with a reference prompt and positional embeddings, then rectified by subtracting a generated bias logit map to enable better unsupervised semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that biases in CLIP limit its use for unsupervised semantic segmentation and shows how to model and correct them. Class-preference bias is captured by a learnable Reference prompt, while space-preference bias comes from projecting the vision transformer's positional embedding. These are kept separate as features, multiplied to form a bias logit map, and subtracted from CLIP's output logits. A mask decoder then produces refined masks using Gumbel-Softmax, guided by a contrastive loss on masked features and class text features. This leads to stronger results across standard segmentation benchmarks without any annotations.

Core claim

The paper claims that by independently encoding class-preference bias into a Reference feature and space-preference bias into a positional feature, generating a bias logit map via their matrix multiplication, subtracting this map from CLIP logits, and processing the result through a mask decoder with contrastive supervision, the bias in CLIP is rectified to produce accurate unsupervised semantic segmentation.

What carries the argument

The bias logit map, created by matrix multiplication of the Reference feature and positional feature, which is subtracted element-wise from CLIP logits to rectify biases before mask decoding.

If this is right

The method achieves favorable performance against prior state-of-the-art on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff.
The contrastive loss ensures the bias rectification process is meaningful by aligning visual and text features.
The mask decoder smooths the rectified logits into contextual segmentation masks.
Separate encoding of biases avoids interference between class and space preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bias modeling might apply to other dense prediction tasks using CLIP-like models.
Exploring different ways to combine the bias features could yield further improvements.
The approach suggests that explicit bias correction could enhance CLIP in low-supervision scenarios beyond segmentation.

Load-bearing premise

The assumption that independently encoding the two biases into a Reference feature and a positional feature, then combining them via matrix multiplication, produces a bias logit map whose subtraction from CLIP logits yields a meaningful rectification.

What would settle it

If experiments show that the bias logit map subtraction does not lead to improved or even degrades segmentation accuracy on the mentioned benchmarks compared to using raw CLIP logits, the rectification approach would be falsified.

read the original abstract

Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an explicit bias subtraction step using a reference prompt and positional features, but the evidence that this isolates the claimed biases is thin.

read the letter

The paper's main move is to build a bias logit map from a learnable Reference prompt (for class-preference bias) and a projection of ViT positional embeddings (for space-preference bias), multiply the two features, subtract the result from CLIP logits, then feed the rectified logits plus CLIP features into a Gumbel-Softmax mask decoder with an added contrastive loss on masked visual features versus text embeddings. This construction is not in the earlier CLIP segmentation papers cited in the abstract. The method reports better numbers than prior work on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff, and the code is released. That combination of a concrete new mechanism and competitive benchmarks is the useful part. The soft spot is exactly where the stress-test note points: the matrix multiplication is presented as producing a map that specifically represents the two biases, yet nothing in the construction or the reported results enforces or verifies that alignment. Without ablations on the bias map itself, analysis of what the map actually captures in uniform regions, or checks that subtraction removes bias rather than signal, the performance lift could come from the extra parameters and decoder rather than principled rectification. The contrastive loss makes the overall loop trainable but does not prove the intermediate map is meaningful. This is for groups already working on CLIP adaptations for dense prediction or unsupervised segmentation. It deserves peer review because the design is straightforward to reproduce and the empirical claims are testable, even if the bias-isolation claim needs more support.

Referee Report

2 major / 2 minor

Summary. The paper claims that CLIP exhibits class-preference and space-preference biases when applied to unsupervised semantic segmentation, and proposes ReCLIP++ to explicitly model and rectify them: a learnable Reference prompt encodes class bias into a Reference feature, a projection of ViT positional embeddings encodes space bias into a positional feature, their matrix product yields a bias logit map, which is subtracted element-wise from CLIP logits; a mask decoder with Gumbel-Softmax then produces the final mask, trained with a contrastive loss on masked visual and text features. Experiments on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff report favorable results against prior SOTA, with code released.

Significance. If the bias logit map construction can be shown to isolate the claimed biases rather than serving as an arbitrary learned correction, the explicit rectification approach would strengthen the use of CLIP for dense prediction tasks and provide a reusable template for bias analysis in vision-language models. The multi-benchmark evaluation and open implementation are positive contributions.

major comments (2)

[abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.
[abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.

minor comments (2)

[experiments] The paper should include visualizations of the generated bias logit maps and their alignment with expected class/space biases to support the modeling claim.
[method] Notation for the Reference prompt parameters and the positional projection should be formalized with equations to clarify independence of the two bias encodings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline planned revisions to strengthen the presentation of the bias rectification approach.

read point-by-point responses

Referee: [abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.

Authors: The architecture explicitly separates the inputs to the two features—the learnable Reference prompt is optimized to capture class-preference bias while the projection of ViT positional embeddings targets space-preference bias—before their outer product forms the bias logit map that is subtracted from CLIP logits. This separation is intended to model the two biases independently rather than as a generic correction. We acknowledge that the formulation alone does not mathematically guarantee isolation from arbitrary adjustments. To provide the requested evidence, we will add correlation analysis between the subtracted bias maps and CLIP behavior on uniform regions and class-imbalanced patches in the revised manuscript. revision: yes
Referee: [abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.

Authors: The contrastive loss is computed on masked visual features (produced after rectification and Gumbel-Softmax decoding) against class text features, thereby providing end-to-end supervision that includes the rectification step. The current manuscript does not contain a formal derivation or an ablation that isolates supervision specifically on the bias components versus the mask decoder. We will add such an ablation study in the revision to clarify the contribution of the rectification step. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture and loss evaluated externally

full rationale

The paper introduces a learnable Reference prompt, positional projection, matrix-multiplication bias map, element-wise subtraction, mask decoder with Gumbel-Softmax, and contrastive loss on masked features. These are trained end-to-end and evaluated on held-out benchmarks (PASCAL VOC, ADE20K, etc.). No equation reduces a reported quantity to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise for the rectification step, and the contrastive objective is a standard training signal rather than a tautological renaming of inputs. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that CLIP exhibits separable class and space biases that can be captured by the proposed reference and positional features; the method introduces one learnable component (the reference prompt) and one derived representation (the bias logit map) whose independent evidence is limited to the training objective itself.

free parameters (1)

learnable Reference prompt parameters
The reference prompt is optimized during training to encode class-preference bias; its weights are not fixed a priori.

axioms (1)

domain assumption CLIP exhibits class-preference bias and space-preference bias that can be modeled independently without interference
Invoked when the authors state that the two biases are encoded separately to avoid interference before matrix multiplication.

invented entities (1)

bias logit map no independent evidence
purpose: Explicit representation of combined class and space biases obtained via matrix multiplication of reference and positional features
New derived quantity introduced to enable the rectification step; no external falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5843 in / 1593 out tokens · 51351 ms · 2026-05-23T21:39:52.637156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,

G. Kang, Y . Wei, Y . Yang, Y . Zhuang, and A. Hauptmann, “Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,” Advances in neural informa- tion processing systems, vol. 33, pp. 3569–3580, 2020

work page 2020
[2]

Contrastive adaptation net- work for single-and multi-source domain adap- tation,

G. Kang, L. Jiang, Y . Wei, Y . Yang, and A. Hauptmann, “Contrastive adaptation net- work for single-and multi-source domain adap- tation,” IEEE transactions on pattern analysis and machine intelligence , vol. 44, no. 4, pp. 1793–1804, 2020

work page 2020
[3]

Llmformer: large language model for open-vocabulary semantic segmentation,

H. Shi, S. D. Dao, and J. Cai, “Llmformer: large language model for open-vocabulary semantic segmentation,” International Journal of Com- puter Vision, pp. 1–18, 2024

work page 2024
[4]

Multi-modal prototypes for open-world semantic segmentation,

Y . Yang, C. Ma, C. Ju, F. Zhang, J. Yao, Y . Zhang, and Y . Wang, “Multi-modal prototypes for open-world semantic segmentation,” Inter- national Journal of Computer Vision , pp. 1–17, 2024

work page 2024
[5]

Ov-vis: Open-vocabulary video instance segmentation,

H. Wang, C. Yan, K. Chen, X. Jiang, X. Tang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Ov-vis: Open-vocabulary video instance segmentation,” International Journal of Computer Vision, pp. 1– 18, 2024

work page 2024
[6]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890

work page 2017
[7]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Pro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2020, pp. 11 534–11 542

work page 2020
[8]

Per- pixel classification is not all you need for seman- tic segmentation,

B. Cheng, A. Schwing, and A. Kirillov, “Per- pixel classification is not all you need for seman- tic segmentation,” Advances in Neural Informa- tion Processing Systems , vol. 34, pp. 17 864– 17 875, 2021

work page 2021
[9]

Masked-attention mask 13 transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kir- illov, and R. Girdhar, “Masked-attention mask 13 transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299

work page 2022
[10]

Invariant information clustering for unsupervised image classification and segmentation,

X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classification and segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9865–9874

work page 2019
[11]

Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,

J. H. Cho, U. Mall, K. Bala, and B. Hariha- ran, “Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, 2021, pp. 16 794–16 804

work page 2021
[12]

Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,

T.-W. Ke, J.-J. Hwang, Y . Guo, X. Wang, and S. X. Yu, “Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2571–2581

work page 2022
[13]

Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,

L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8364– 8375

work page 2022
[14]

Autore- gressive unsupervised image segmentation,

Y . Ouali, C. Hudelot, and M. Tami, “Autore- gressive unsupervised image segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 2020, pp. 142–158

work page 2020
[15]

Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,

K. Li, Z. Wang, Z. Cheng, R. Yu, Y . Zhao, G. Song, C. Liu, L. Yuan, and J. Chen, “Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7162–7172

work page 2023
[16]

Unsupervised seman- tic segmentation by contrasting object mask pro- posals,

W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, and L. Van Gool, “Unsupervised seman- tic segmentation by contrasting object mask pro- posals,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2021, pp. 10 052–10 062

work page 2021
[17]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[18]

Segsort: Segmentation by discriminative sorting of seg- ments,

J.-J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T.- J. Yang, X. Zhang, and L.-C. Chen, “Segsort: Segmentation by discriminative sorting of seg- ments,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2019, pp. 7334–7344

work page 2019
[19]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021
[20]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in Computer Vision– ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, 2022, pp. 696–712

work page 2022
[21]

Reco: Retrieve and co-segment for zero-shot transfer,

G. Shin, W. Xie, and S. Albanie, “Reco: Retrieve and co-segment for zero-shot transfer,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 33 754–33 767, 2022

work page 2022
[22]

Perceptual grouping in contrastive vision-language models,

K. Ranasinghe, B. McKinzie, S. Ravi, Y . Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language models,” in Pro- ceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 5571–5584

work page 2023
[23]

Clip- s4: Language-guided self-supervised semantic segmentation,

W. He, S. Jamonnak, L. Gou, and L. Ren, “Clip- s4: Language-guided self-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 207–11 216

work page 2023
[24]

Learn to rectify the bias of clip for unsupervised semantic segmentation,

J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4102–4112. 14

work page 2024
[25]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010

work page 2010
[26]

The role of context for object detection and semantic segmentation in the wild,

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 891–898

work page 2014
[27]

Scene parsing through ade20k dataset,

B. Zhou, Z. Hang, F. X. P. Fernandez, S. Fidler, and A. Torralba, “Scene parsing through ade20k dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[28]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

work page 2016
[29]

Coco- stuff: Thing and stuff classes in context,

H. Caesar, J. Uijlings, and V . Ferrari, “Coco- stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 1209–1218

work page 2018
[30]

Uniter: Universal image-text representation learning,

Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” inEuropean Conference on Computer Vision, 2020, pp. 104– 120

work page 2020
[31]

Virtex: Learning visual representations from textual annotations,

K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 11 162–11 173

work page 2021
[32]

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,

G. Li, N. Duan, Y . Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Pro- ceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 336– 11 344

work page 2020
[33]

Align before fuse: Vision and language representation learn- ing with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learn- ing with momentum distillation,” Advances in neural information processing systems , vol. 34, pp. 9694–9705, 2021

work page 2021
[34]

Hero: Hierarchical encoder for video+ language omni-representation pre-training,

L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+ language omni-representation pre-training,” in Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), 2020, pp. 2046–2065

work page 2020
[35]

Scaling up visual and vision- language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904– 4916

work page 2021
[36]

Slip: Self-supervision meets language-image pre-training,

N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European Conference on Com- puter Vision. Springer, 2022, pp. 529–544

work page 2022
[37]

Learning to prompt for open-vocabulary object detection with vision-language model,

Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093

work page 2022
[38]

Detecting everything in the open world: Towards universal object detec- tion,

Z. Wang, Y . Li, X. Chen, S.-N. Lim, A. Torralba, H. Zhao, and S. Wang, “Detecting everything in the open world: Towards universal object detec- tion,”arXiv preprint arXiv:2303.11749, 2023

work page arXiv 2023
[39]

Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,

J. Cha, J. Mun, and B. Roh, “Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 165–11 174

work page 2023
[40]

Side adapter network for open-vocabulary semantic segmentation,

M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. 15

work page 2023
[41]

Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,

J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 4981–4990

work page 2018
[42]

Weakly supervised semantic segmentation by pixel- to-prototype contrast,

Y . Du, Z. Fu, Q. Liu, and Y . Wang, “Weakly supervised semantic segmentation by pixel- to-prototype contrast,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329

work page 2022
[43]

From image- level to pixel-level labeling with convolutional networks,

P. O. Pinheiro and R. Collobert, “From image- level to pixel-level labeling with convolutional networks,” in Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 1713–1721

work page 2015
[44]

C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,

J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 989–998

work page 2022
[45]

Scan: Learning to classify images without labels,

W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X. Springer, 2020, pp. 268– 285

work page 2020
[46]

Hamilton, Z

M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, in The Tenth International Conference on Learning Rep- resentations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022
[47]

Emergence of object segmentation in perturbed generative models,

A. Bielski and P. Favaro, “Emergence of object segmentation in perturbed generative models,” Advances in Neural Information Processing Sys- tems, vol. 32, 2019

work page 2019
[48]

Unsu- pervised object segmentation by redrawing,

M. Chen, T. Arti `eres, and L. Denoyer, “Unsu- pervised object segmentation by redrawing,” Advances in neural information processing sys- tems, vol. 32, 2019

work page 2019
[49]

Gener- ative adversarial networks: An overview,

A. Creswell, T. White, V . Dumoulin, K. Arulku- maran, B. Sengupta, and A. A. Bharath, “Gener- ative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018

work page 2018
[50]

Learning neural eigenfunc- tions for unsupervised semantic segmentation,

Z. Deng and Y . Luo, “Learning neural eigenfunc- tions for unsupervised semantic segmentation,” arXiv preprint arXiv:2304.02841, 2023

work page arXiv 2023
[51]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

work page 1955
[52]

Emerg- ing properties in self-supervised vision trans- formers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerg- ing properties in self-supervised vision trans- formers,” inProceedings of the IEEE/CVF inter- national conference on computer vision , 2021, pp. 9650–9660

work page 2021
[53]

Zero-shot semantic segmentation,

M. Bucher, T.-H. Vu, M. Cord, and P. P ´erez, “Zero-shot semantic segmentation,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[54]

Semantic projection network for zero- and few-label semantic segmentation,

Y . Xian, S. Choudhury, Y . He, B. Schiele, and Z. Akata, “Semantic projection network for zero- and few-label semantic segmentation,” in Pro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 8256–8265

work page 2019
[55]

A closer look at self- training for zero-label semantic segmentation,

G. Pastore, F. Cermelli, Y . Xian, M. Mancini, Z. Akata, and B. Caputo, “A closer look at self- training for zero-label semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2693–2702

work page 2021
[56]

Language-driven semantic segmentation,

B. Li, K. Q. Weinberger, S. J. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in The Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022

work page 2022
[57]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 16 pp. 18 134–18 144

work page 2022
[58]

Scal- ing open-vocabulary image segmentation with image-level labels,

G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin, “Scal- ing open-vocabulary image segmentation with image-level labels,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 540–557

work page 2022
[59]

Learning open-vocabulary semantic segmentation models from natural language supervision,

J. Xu, J. Hou, Y . Zhang, R. Feng, Y . Wang, Y . Qiao, and W. Xie, “Learning open-vocabulary semantic segmentation models from natural language supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2935–2944

work page 2023
[60]

Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,

P. Ren, C. Li, H. Xu, Y . Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,”arXiv preprint arXiv:2302.10307, 2023

work page arXiv 2023
[61]

Open-vocabulary semantic segmenta- tion with mask-adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Mar- culescu, “Open-vocabulary semantic segmenta- tion with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070

work page 2023
[62]

Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,

Y . Xing, J. Kang, A. Xiao, J. Nie, L. Shao, and S. Lu, “Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[63]

Clearclip: Decomposing clip repre- sentations for dense vision-language inference,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip repre- sentations for dense vision-language inference,” arXiv preprint arXiv:2407.12442, 2024

work page arXiv 2024
[64]

Sclip: Rethink- ing self-attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethink- ing self-attention for dense vision-language inference,” in European Conference on Com- puter Vision. Springer, 2025, pp. 315–332

work page 2025
[65]

Learning to prompt for vision-language mod- els,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language mod- els,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022
[66]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Represen- tations, 2021

work page 2021
[67]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in5th International Conference on Learning Represen- tations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017

work page 2017
[68]

Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018

work page 2018
[69]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

work page 2017
[70]

A stochastic approx- imation method,

H. Robbins and S. Monro, “A stochastic approx- imation method,” The annals of mathematical statistics, pp. 400–407, 1951. 17

work page 1951

[1] [1]

Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,

G. Kang, Y . Wei, Y . Yang, Y . Zhuang, and A. Hauptmann, “Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,” Advances in neural informa- tion processing systems, vol. 33, pp. 3569–3580, 2020

work page 2020

[2] [2]

Contrastive adaptation net- work for single-and multi-source domain adap- tation,

G. Kang, L. Jiang, Y . Wei, Y . Yang, and A. Hauptmann, “Contrastive adaptation net- work for single-and multi-source domain adap- tation,” IEEE transactions on pattern analysis and machine intelligence , vol. 44, no. 4, pp. 1793–1804, 2020

work page 2020

[3] [3]

Llmformer: large language model for open-vocabulary semantic segmentation,

H. Shi, S. D. Dao, and J. Cai, “Llmformer: large language model for open-vocabulary semantic segmentation,” International Journal of Com- puter Vision, pp. 1–18, 2024

work page 2024

[4] [4]

Multi-modal prototypes for open-world semantic segmentation,

Y . Yang, C. Ma, C. Ju, F. Zhang, J. Yao, Y . Zhang, and Y . Wang, “Multi-modal prototypes for open-world semantic segmentation,” Inter- national Journal of Computer Vision , pp. 1–17, 2024

work page 2024

[5] [5]

Ov-vis: Open-vocabulary video instance segmentation,

H. Wang, C. Yan, K. Chen, X. Jiang, X. Tang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Ov-vis: Open-vocabulary video instance segmentation,” International Journal of Computer Vision, pp. 1– 18, 2024

work page 2024

[6] [6]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890

work page 2017

[7] [7]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Pro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2020, pp. 11 534–11 542

work page 2020

[8] [8]

Per- pixel classification is not all you need for seman- tic segmentation,

B. Cheng, A. Schwing, and A. Kirillov, “Per- pixel classification is not all you need for seman- tic segmentation,” Advances in Neural Informa- tion Processing Systems , vol. 34, pp. 17 864– 17 875, 2021

work page 2021

[9] [9]

Masked-attention mask 13 transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kir- illov, and R. Girdhar, “Masked-attention mask 13 transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299

work page 2022

[10] [10]

Invariant information clustering for unsupervised image classification and segmentation,

X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classification and segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9865–9874

work page 2019

[11] [11]

Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,

J. H. Cho, U. Mall, K. Bala, and B. Hariha- ran, “Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, 2021, pp. 16 794–16 804

work page 2021

[12] [12]

Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,

T.-W. Ke, J.-J. Hwang, Y . Guo, X. Wang, and S. X. Yu, “Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2571–2581

work page 2022

[13] [13]

Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,

L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8364– 8375

work page 2022

[14] [14]

Autore- gressive unsupervised image segmentation,

Y . Ouali, C. Hudelot, and M. Tami, “Autore- gressive unsupervised image segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 2020, pp. 142–158

work page 2020

[15] [15]

Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,

K. Li, Z. Wang, Z. Cheng, R. Yu, Y . Zhao, G. Song, C. Liu, L. Yuan, and J. Chen, “Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7162–7172

work page 2023

[16] [16]

Unsupervised seman- tic segmentation by contrasting object mask pro- posals,

W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, and L. Van Gool, “Unsupervised seman- tic segmentation by contrasting object mask pro- posals,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2021, pp. 10 052–10 062

work page 2021

[17] [17]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020

[18] [18]

Segsort: Segmentation by discriminative sorting of seg- ments,

J.-J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T.- J. Yang, X. Zhang, and L.-C. Chen, “Segsort: Segmentation by discriminative sorting of seg- ments,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2019, pp. 7334–7344

work page 2019

[19] [19]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021

[20] [20]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in Computer Vision– ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, 2022, pp. 696–712

work page 2022

[21] [21]

Reco: Retrieve and co-segment for zero-shot transfer,

G. Shin, W. Xie, and S. Albanie, “Reco: Retrieve and co-segment for zero-shot transfer,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 33 754–33 767, 2022

work page 2022

[22] [22]

Perceptual grouping in contrastive vision-language models,

K. Ranasinghe, B. McKinzie, S. Ravi, Y . Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language models,” in Pro- ceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 5571–5584

work page 2023

[23] [23]

Clip- s4: Language-guided self-supervised semantic segmentation,

W. He, S. Jamonnak, L. Gou, and L. Ren, “Clip- s4: Language-guided self-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 207–11 216

work page 2023

[24] [24]

Learn to rectify the bias of clip for unsupervised semantic segmentation,

J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4102–4112. 14

work page 2024

[25] [25]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010

work page 2010

[26] [26]

The role of context for object detection and semantic segmentation in the wild,

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 891–898

work page 2014

[27] [27]

Scene parsing through ade20k dataset,

B. Zhou, Z. Hang, F. X. P. Fernandez, S. Fidler, and A. Torralba, “Scene parsing through ade20k dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017

[28] [28]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

work page 2016

[29] [29]

Coco- stuff: Thing and stuff classes in context,

H. Caesar, J. Uijlings, and V . Ferrari, “Coco- stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 1209–1218

work page 2018

[30] [30]

Uniter: Universal image-text representation learning,

Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” inEuropean Conference on Computer Vision, 2020, pp. 104– 120

work page 2020

[31] [31]

Virtex: Learning visual representations from textual annotations,

K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 11 162–11 173

work page 2021

[32] [32]

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,

G. Li, N. Duan, Y . Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Pro- ceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 336– 11 344

work page 2020

[33] [33]

Align before fuse: Vision and language representation learn- ing with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learn- ing with momentum distillation,” Advances in neural information processing systems , vol. 34, pp. 9694–9705, 2021

work page 2021

[34] [34]

Hero: Hierarchical encoder for video+ language omni-representation pre-training,

L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+ language omni-representation pre-training,” in Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), 2020, pp. 2046–2065

work page 2020

[35] [35]

Scaling up visual and vision- language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904– 4916

work page 2021

[36] [36]

Slip: Self-supervision meets language-image pre-training,

N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European Conference on Com- puter Vision. Springer, 2022, pp. 529–544

work page 2022

[37] [37]

Learning to prompt for open-vocabulary object detection with vision-language model,

Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093

work page 2022

[38] [38]

Detecting everything in the open world: Towards universal object detec- tion,

Z. Wang, Y . Li, X. Chen, S.-N. Lim, A. Torralba, H. Zhao, and S. Wang, “Detecting everything in the open world: Towards universal object detec- tion,”arXiv preprint arXiv:2303.11749, 2023

work page arXiv 2023

[39] [39]

Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,

J. Cha, J. Mun, and B. Roh, “Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 165–11 174

work page 2023

[40] [40]

Side adapter network for open-vocabulary semantic segmentation,

M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. 15

work page 2023

[41] [41]

Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,

J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 4981–4990

work page 2018

[42] [42]

Weakly supervised semantic segmentation by pixel- to-prototype contrast,

Y . Du, Z. Fu, Q. Liu, and Y . Wang, “Weakly supervised semantic segmentation by pixel- to-prototype contrast,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329

work page 2022

[43] [43]

From image- level to pixel-level labeling with convolutional networks,

P. O. Pinheiro and R. Collobert, “From image- level to pixel-level labeling with convolutional networks,” in Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 1713–1721

work page 2015

[44] [44]

C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,

J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 989–998

work page 2022

[45] [45]

Scan: Learning to classify images without labels,

W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X. Springer, 2020, pp. 268– 285

work page 2020

[46] [46]

Hamilton, Z

M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, in The Tenth International Conference on Learning Rep- resentations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022

[47] [47]

Emergence of object segmentation in perturbed generative models,

A. Bielski and P. Favaro, “Emergence of object segmentation in perturbed generative models,” Advances in Neural Information Processing Sys- tems, vol. 32, 2019

work page 2019

[48] [48]

Unsu- pervised object segmentation by redrawing,

M. Chen, T. Arti `eres, and L. Denoyer, “Unsu- pervised object segmentation by redrawing,” Advances in neural information processing sys- tems, vol. 32, 2019

work page 2019

[49] [49]

Gener- ative adversarial networks: An overview,

A. Creswell, T. White, V . Dumoulin, K. Arulku- maran, B. Sengupta, and A. A. Bharath, “Gener- ative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018

work page 2018

[50] [50]

Learning neural eigenfunc- tions for unsupervised semantic segmentation,

Z. Deng and Y . Luo, “Learning neural eigenfunc- tions for unsupervised semantic segmentation,” arXiv preprint arXiv:2304.02841, 2023

work page arXiv 2023

[51] [51]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

work page 1955

[52] [52]

Emerg- ing properties in self-supervised vision trans- formers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerg- ing properties in self-supervised vision trans- formers,” inProceedings of the IEEE/CVF inter- national conference on computer vision , 2021, pp. 9650–9660

work page 2021

[53] [53]

Zero-shot semantic segmentation,

M. Bucher, T.-H. Vu, M. Cord, and P. P ´erez, “Zero-shot semantic segmentation,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019

[54] [54]

Semantic projection network for zero- and few-label semantic segmentation,

Y . Xian, S. Choudhury, Y . He, B. Schiele, and Z. Akata, “Semantic projection network for zero- and few-label semantic segmentation,” in Pro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 8256–8265

work page 2019

[55] [55]

A closer look at self- training for zero-label semantic segmentation,

G. Pastore, F. Cermelli, Y . Xian, M. Mancini, Z. Akata, and B. Caputo, “A closer look at self- training for zero-label semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2693–2702

work page 2021

[56] [56]

Language-driven semantic segmentation,

B. Li, K. Q. Weinberger, S. J. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in The Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022

work page 2022

[57] [57]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 16 pp. 18 134–18 144

work page 2022

[58] [58]

Scal- ing open-vocabulary image segmentation with image-level labels,

G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin, “Scal- ing open-vocabulary image segmentation with image-level labels,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 540–557

work page 2022

[59] [59]

Learning open-vocabulary semantic segmentation models from natural language supervision,

J. Xu, J. Hou, Y . Zhang, R. Feng, Y . Wang, Y . Qiao, and W. Xie, “Learning open-vocabulary semantic segmentation models from natural language supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2935–2944

work page 2023

[60] [60]

Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,

P. Ren, C. Li, H. Xu, Y . Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,”arXiv preprint arXiv:2302.10307, 2023

work page arXiv 2023

[61] [61]

Open-vocabulary semantic segmenta- tion with mask-adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Mar- culescu, “Open-vocabulary semantic segmenta- tion with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070

work page 2023

[62] [62]

Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,

Y . Xing, J. Kang, A. Xiao, J. Nie, L. Shao, and S. Lu, “Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[63] [63]

Clearclip: Decomposing clip repre- sentations for dense vision-language inference,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip repre- sentations for dense vision-language inference,” arXiv preprint arXiv:2407.12442, 2024

work page arXiv 2024

[64] [64]

Sclip: Rethink- ing self-attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethink- ing self-attention for dense vision-language inference,” in European Conference on Com- puter Vision. Springer, 2025, pp. 315–332

work page 2025

[65] [65]

Learning to prompt for vision-language mod- els,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language mod- els,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022

[66] [66]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Represen- tations, 2021

work page 2021

[67] [67]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in5th International Conference on Learning Represen- tations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017

work page 2017

[68] [68]

Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018

work page 2018

[69] [69]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

work page 2017

[70] [70]

A stochastic approx- imation method,

H. Robbins and S. Monro, “A stochastic approx- imation method,” The annals of mathematical statistics, pp. 400–407, 1951. 17

work page 1951