pith. sign in

arxiv: 2408.06747 · v4 · submitted 2024-08-13 · 💻 cs.CV

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Pith reviewed 2026-05-23 21:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords unsupervised semantic segmentationCLIPbias rectificationreference promptpositional embeddingcontrastive learningmask decoderGumbel-Softmax
0
0 comments X

The pith

CLIP's class-preference and space-preference biases can be explicitly modeled with a reference prompt and positional embeddings, then rectified by subtracting a generated bias logit map to enable better unsupervised semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that biases in CLIP limit its use for unsupervised semantic segmentation and shows how to model and correct them. Class-preference bias is captured by a learnable Reference prompt, while space-preference bias comes from projecting the vision transformer's positional embedding. These are kept separate as features, multiplied to form a bias logit map, and subtracted from CLIP's output logits. A mask decoder then produces refined masks using Gumbel-Softmax, guided by a contrastive loss on masked features and class text features. This leads to stronger results across standard segmentation benchmarks without any annotations.

Core claim

The paper claims that by independently encoding class-preference bias into a Reference feature and space-preference bias into a positional feature, generating a bias logit map via their matrix multiplication, subtracting this map from CLIP logits, and processing the result through a mask decoder with contrastive supervision, the bias in CLIP is rectified to produce accurate unsupervised semantic segmentation.

What carries the argument

The bias logit map, created by matrix multiplication of the Reference feature and positional feature, which is subtracted element-wise from CLIP logits to rectify biases before mask decoding.

If this is right

  • The method achieves favorable performance against prior state-of-the-art on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff.
  • The contrastive loss ensures the bias rectification process is meaningful by aligning visual and text features.
  • The mask decoder smooths the rectified logits into contextual segmentation masks.
  • Separate encoding of biases avoids interference between class and space preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bias modeling might apply to other dense prediction tasks using CLIP-like models.
  • Exploring different ways to combine the bias features could yield further improvements.
  • The approach suggests that explicit bias correction could enhance CLIP in low-supervision scenarios beyond segmentation.

Load-bearing premise

The assumption that independently encoding the two biases into a Reference feature and a positional feature, then combining them via matrix multiplication, produces a bias logit map whose subtraction from CLIP logits yields a meaningful rectification.

What would settle it

If experiments show that the bias logit map subtraction does not lead to improved or even degrades segmentation accuracy on the mentioned benchmarks compared to using raw CLIP logits, the rectification approach would be falsified.

read the original abstract

Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that CLIP exhibits class-preference and space-preference biases when applied to unsupervised semantic segmentation, and proposes ReCLIP++ to explicitly model and rectify them: a learnable Reference prompt encodes class bias into a Reference feature, a projection of ViT positional embeddings encodes space bias into a positional feature, their matrix product yields a bias logit map, which is subtracted element-wise from CLIP logits; a mask decoder with Gumbel-Softmax then produces the final mask, trained with a contrastive loss on masked visual and text features. Experiments on PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff report favorable results against prior SOTA, with code released.

Significance. If the bias logit map construction can be shown to isolate the claimed biases rather than serving as an arbitrary learned correction, the explicit rectification approach would strengthen the use of CLIP for dense prediction tasks and provide a reusable template for bias analysis in vision-language models. The multi-benchmark evaluation and open implementation are positive contributions.

major comments (2)
  1. [abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.
  2. [abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.
minor comments (2)
  1. [experiments] The paper should include visualizations of the generated bias logit maps and their alignment with expected class/space biases to support the modeling claim.
  2. [method] Notation for the Reference prompt parameters and the positional projection should be formalized with equations to clarify independence of the two bias encodings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline planned revisions to strengthen the presentation of the bias rectification approach.

read point-by-point responses
  1. Referee: [abstract / method description] Abstract and method description: the central construction (independent encoding of Reference feature and positional feature, followed by matrix multiplication to produce the bias logit map and element-wise subtraction from CLIP logits) does not enforce that the resulting map isolates class-preference and space-preference biases; nothing in the formulation prevents it from learning an arbitrary adjustment, so performance gains cannot be attributed specifically to bias rectification without additional evidence such as correlation analysis between the subtracted map and observed CLIP biases on uniform regions or class-imbalanced patches.

    Authors: The architecture explicitly separates the inputs to the two features—the learnable Reference prompt is optimized to capture class-preference bias while the projection of ViT positional embeddings targets space-preference bias—before their outer product forms the bias logit map that is subtracted from CLIP logits. This separation is intended to model the two biases independently rather than as a generic correction. We acknowledge that the formulation alone does not mathematically guarantee isolation from arbitrary adjustments. To provide the requested evidence, we will add correlation analysis between the subtracted bias maps and CLIP behavior on uniform regions and class-imbalanced patches in the revised manuscript. revision: yes

  2. Referee: [abstract] Abstract: the contrastive loss is stated to make 'the bias modeling and rectification process meaningful and effective,' yet no derivation or ablation demonstrates that the loss directly supervises the bias components rather than the downstream mask decoder; this leaves open whether the rectification step is load-bearing or incidental to the added capacity.

    Authors: The contrastive loss is computed on masked visual features (produced after rectification and Gumbel-Softmax decoding) against class text features, thereby providing end-to-end supervision that includes the rectification step. The current manuscript does not contain a formal derivation or an ablation that isolates supervision specifically on the bias components versus the mask decoder. We will add such an ablation study in the revision to clarify the contribution of the rectification step. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture and loss evaluated externally

full rationale

The paper introduces a learnable Reference prompt, positional projection, matrix-multiplication bias map, element-wise subtraction, mask decoder with Gumbel-Softmax, and contrastive loss on masked features. These are trained end-to-end and evaluated on held-out benchmarks (PASCAL VOC, ADE20K, etc.). No equation reduces a reported quantity to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise for the rectification step, and the contrastive objective is a standard training signal rather than a tautological renaming of inputs. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that CLIP exhibits separable class and space biases that can be captured by the proposed reference and positional features; the method introduces one learnable component (the reference prompt) and one derived representation (the bias logit map) whose independent evidence is limited to the training objective itself.

free parameters (1)
  • learnable Reference prompt parameters
    The reference prompt is optimized during training to encode class-preference bias; its weights are not fixed a priori.
axioms (1)
  • domain assumption CLIP exhibits class-preference bias and space-preference bias that can be modeled independently without interference
    Invoked when the authors state that the two biases are encoded separately to avoid interference before matrix multiplication.
invented entities (1)
  • bias logit map no independent evidence
    purpose: Explicit representation of combined class and space biases obtained via matrix multiplication of reference and positional features
    New derived quantity introduced to enable the rectification step; no external falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5843 in / 1593 out tokens · 51351 ms · 2026-05-23T21:39:52.637156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,

    G. Kang, Y . Wei, Y . Yang, Y . Zhuang, and A. Hauptmann, “Pixel-level cycle association: A new perspective for domain adaptive seman- tic segmentation,” Advances in neural informa- tion processing systems, vol. 33, pp. 3569–3580, 2020

  2. [2]

    Contrastive adaptation net- work for single-and multi-source domain adap- tation,

    G. Kang, L. Jiang, Y . Wei, Y . Yang, and A. Hauptmann, “Contrastive adaptation net- work for single-and multi-source domain adap- tation,” IEEE transactions on pattern analysis and machine intelligence , vol. 44, no. 4, pp. 1793–1804, 2020

  3. [3]

    Llmformer: large language model for open-vocabulary semantic segmentation,

    H. Shi, S. D. Dao, and J. Cai, “Llmformer: large language model for open-vocabulary semantic segmentation,” International Journal of Com- puter Vision, pp. 1–18, 2024

  4. [4]

    Multi-modal prototypes for open-world semantic segmentation,

    Y . Yang, C. Ma, C. Ju, F. Zhang, J. Yao, Y . Zhang, and Y . Wang, “Multi-modal prototypes for open-world semantic segmentation,” Inter- national Journal of Computer Vision , pp. 1–17, 2024

  5. [5]

    Ov-vis: Open-vocabulary video instance segmentation,

    H. Wang, C. Yan, K. Chen, X. Jiang, X. Tang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Ov-vis: Open-vocabulary video instance segmentation,” International Journal of Computer Vision, pp. 1– 18, 2024

  6. [6]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890

  7. [7]

    Eca-net: Efficient channel attention for deep convolutional neural networks,

    Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Pro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2020, pp. 11 534–11 542

  8. [8]

    Per- pixel classification is not all you need for seman- tic segmentation,

    B. Cheng, A. Schwing, and A. Kirillov, “Per- pixel classification is not all you need for seman- tic segmentation,” Advances in Neural Informa- tion Processing Systems , vol. 34, pp. 17 864– 17 875, 2021

  9. [9]

    Masked-attention mask 13 transformer for universal image segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kir- illov, and R. Girdhar, “Masked-attention mask 13 transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299

  10. [10]

    Invariant information clustering for unsupervised image classification and segmentation,

    X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classification and segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9865–9874

  11. [11]

    Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,

    J. H. Cho, U. Mall, K. Bala, and B. Hariha- ran, “Picie: Unsupervised semantic segmenta- tion using invariance and equivariance in cluster- ing,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, 2021, pp. 16 794–16 804

  12. [12]

    Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,

    T.-W. Ke, J.-J. Hwang, Y . Guo, X. Wang, and S. X. Yu, “Unsupervised hierarchical seman- tic segmentation with multiview cosegmentation and clustering transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2571–2581

  13. [13]

    Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,

    L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surpris- ingly strong baseline for unsupervised seman- tic segmentation and localization,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8364– 8375

  14. [14]

    Autore- gressive unsupervised image segmentation,

    Y . Ouali, C. Hudelot, and M. Tami, “Autore- gressive unsupervised image segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 2020, pp. 142–158

  15. [15]

    Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,

    K. Li, Z. Wang, Z. Cheng, R. Yu, Y . Zhao, G. Song, C. Liu, L. Yuan, and J. Chen, “Acseg: Adaptive conceptualization for unsuper- vised semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7162–7172

  16. [16]

    Unsupervised seman- tic segmentation by contrasting object mask pro- posals,

    W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, and L. Van Gool, “Unsupervised seman- tic segmentation by contrasting object mask pro- posals,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2021, pp. 10 052–10 062

  17. [17]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

  18. [18]

    Segsort: Segmentation by discriminative sorting of seg- ments,

    J.-J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T.- J. Yang, X. Zhang, and L.-C. Chen, “Segsort: Segmentation by discriminative sorting of seg- ments,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2019, pp. 7334–7344

  19. [19]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

  20. [20]

    Extract free dense labels from clip,

    C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in Computer Vision– ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. Springer, 2022, pp. 696–712

  21. [21]

    Reco: Retrieve and co-segment for zero-shot transfer,

    G. Shin, W. Xie, and S. Albanie, “Reco: Retrieve and co-segment for zero-shot transfer,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 33 754–33 767, 2022

  22. [22]

    Perceptual grouping in contrastive vision-language models,

    K. Ranasinghe, B. McKinzie, S. Ravi, Y . Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language models,” in Pro- ceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 5571–5584

  23. [23]

    Clip- s4: Language-guided self-supervised semantic segmentation,

    W. He, S. Jamonnak, L. Gou, and L. Ren, “Clip- s4: Language-guided self-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 207–11 216

  24. [24]

    Learn to rectify the bias of clip for unsupervised semantic segmentation,

    J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4102–4112. 14

  25. [25]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010

  26. [26]

    The role of context for object detection and semantic segmentation in the wild,

    R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 891–898

  27. [27]

    Scene parsing through ade20k dataset,

    B. Zhou, Z. Hang, F. X. P. Fernandez, S. Fidler, and A. Torralba, “Scene parsing through ade20k dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  28. [28]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

  29. [29]

    Coco- stuff: Thing and stuff classes in context,

    H. Caesar, J. Uijlings, and V . Ferrari, “Coco- stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 1209–1218

  30. [30]

    Uniter: Universal image-text representation learning,

    Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” inEuropean Conference on Computer Vision, 2020, pp. 104– 120

  31. [31]

    Virtex: Learning visual representations from textual annotations,

    K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 11 162–11 173

  32. [32]

    Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,

    G. Li, N. Duan, Y . Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Pro- ceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 336– 11 344

  33. [33]

    Align before fuse: Vision and language representation learn- ing with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learn- ing with momentum distillation,” Advances in neural information processing systems , vol. 34, pp. 9694–9705, 2021

  34. [34]

    Hero: Hierarchical encoder for video+ language omni-representation pre-training,

    L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+ language omni-representation pre-training,” in Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), 2020, pp. 2046–2065

  35. [35]

    Scaling up visual and vision- language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904– 4916

  36. [36]

    Slip: Self-supervision meets language-image pre-training,

    N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European Conference on Com- puter Vision. Springer, 2022, pp. 529–544

  37. [37]

    Learning to prompt for open-vocabulary object detection with vision-language model,

    Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093

  38. [38]

    Detecting everything in the open world: Towards universal object detec- tion,

    Z. Wang, Y . Li, X. Chen, S.-N. Lim, A. Torralba, H. Zhao, and S. Wang, “Detecting everything in the open world: Towards universal object detec- tion,”arXiv preprint arXiv:2303.11749, 2023

  39. [39]

    Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,

    J. Cha, J. Mun, and B. Roh, “Learning to gener- ate text-grounded mask for open-world semantic segmentation from only image-text pairs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 165–11 174

  40. [40]

    Side adapter network for open-vocabulary semantic segmentation,

    M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. 15

  41. [41]

    Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,

    J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 4981–4990

  42. [42]

    Weakly supervised semantic segmentation by pixel- to-prototype contrast,

    Y . Du, Z. Fu, Q. Liu, and Y . Wang, “Weakly supervised semantic segmentation by pixel- to-prototype contrast,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329

  43. [43]

    From image- level to pixel-level labeling with convolutional networks,

    P. O. Pinheiro and R. Collobert, “From image- level to pixel-level labeling with convolutional networks,” in Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 1713–1721

  44. [44]

    C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,

    J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: contrastive learning of class- agnostic activation map for weakly supervised object localization and semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 989–998

  45. [45]

    Scan: Learning to classify images without labels,

    W. Van Gansbeke, S. Vandenhende, S. Geor- goulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X. Springer, 2020, pp. 268– 285

  46. [46]

    Hamilton, Z

    M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, in The Tenth International Conference on Learning Rep- resentations, ICLR 2022, Virtual Event, April 25-29, 2022

  47. [47]

    Emergence of object segmentation in perturbed generative models,

    A. Bielski and P. Favaro, “Emergence of object segmentation in perturbed generative models,” Advances in Neural Information Processing Sys- tems, vol. 32, 2019

  48. [48]

    Unsu- pervised object segmentation by redrawing,

    M. Chen, T. Arti `eres, and L. Denoyer, “Unsu- pervised object segmentation by redrawing,” Advances in neural information processing sys- tems, vol. 32, 2019

  49. [49]

    Gener- ative adversarial networks: An overview,

    A. Creswell, T. White, V . Dumoulin, K. Arulku- maran, B. Sengupta, and A. A. Bharath, “Gener- ative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018

  50. [50]

    Learning neural eigenfunc- tions for unsupervised semantic segmentation,

    Z. Deng and Y . Luo, “Learning neural eigenfunc- tions for unsupervised semantic segmentation,” arXiv preprint arXiv:2304.02841, 2023

  51. [51]

    The hungarian method for the assignment problem,

    H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

  52. [52]

    Emerg- ing properties in self-supervised vision trans- formers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerg- ing properties in self-supervised vision trans- formers,” inProceedings of the IEEE/CVF inter- national conference on computer vision , 2021, pp. 9650–9660

  53. [53]

    Zero-shot semantic segmentation,

    M. Bucher, T.-H. Vu, M. Cord, and P. P ´erez, “Zero-shot semantic segmentation,”Advances in Neural Information Processing Systems, vol. 32, 2019

  54. [54]

    Semantic projection network for zero- and few-label semantic segmentation,

    Y . Xian, S. Choudhury, Y . He, B. Schiele, and Z. Akata, “Semantic projection network for zero- and few-label semantic segmentation,” in Pro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 8256–8265

  55. [55]

    A closer look at self- training for zero-label semantic segmentation,

    G. Pastore, F. Cermelli, Y . Xian, M. Mancini, Z. Akata, and B. Caputo, “A closer look at self- training for zero-label semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2693–2702

  56. [56]

    Language-driven semantic segmentation,

    B. Li, K. Q. Weinberger, S. J. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in The Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022

  57. [57]

    Groupvit: Semantic segmentation emerges from text supervision,

    J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 16 pp. 18 134–18 144

  58. [58]

    Scal- ing open-vocabulary image segmentation with image-level labels,

    G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin, “Scal- ing open-vocabulary image segmentation with image-level labels,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 540–557

  59. [59]

    Learning open-vocabulary semantic segmentation models from natural language supervision,

    J. Xu, J. Hou, Y . Zhang, R. Feng, Y . Wang, Y . Qiao, and W. Xie, “Learning open-vocabulary semantic segmentation models from natural language supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2935–2944

  60. [60]

    Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,

    P. Ren, C. Li, H. Xu, Y . Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discov- ering text-supervised segmentation masks via multi-view semantic consistency,”arXiv preprint arXiv:2302.10307, 2023

  61. [61]

    Open-vocabulary semantic segmenta- tion with mask-adapted clip,

    F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Mar- culescu, “Open-vocabulary semantic segmenta- tion with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070

  62. [62]

    Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,

    Y . Xing, J. Kang, A. Xiao, J. Nie, L. Shao, and S. Lu, “Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation,” Advances in Neural Information Processing Systems, vol. 36, 2024

  63. [63]

    Clearclip: Decomposing clip repre- sentations for dense vision-language inference,

    M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip repre- sentations for dense vision-language inference,” arXiv preprint arXiv:2407.12442, 2024

  64. [64]

    Sclip: Rethink- ing self-attention for dense vision-language inference,

    F. Wang, J. Mei, and A. Yuille, “Sclip: Rethink- ing self-attention for dense vision-language inference,” in European Conference on Com- puter Vision. Springer, 2025, pp. 315–332

  65. [65]

    Learning to prompt for vision-language mod- els,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language mod- els,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022

  66. [66]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Represen- tations, 2021

  67. [67]

    Categorical reparameterization with gumbel-softmax,

    E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in5th International Conference on Learning Represen- tations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017

  68. [68]

    Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Seman- tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018

  69. [69]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  70. [70]

    A stochastic approx- imation method,

    H. Robbins and S. Monro, “A stochastic approx- imation method,” The annals of mathematical statistics, pp. 400–407, 1951. 17