pith. sign in

arxiv: 2605.19623 · v1 · pith:64RS6UZQnew · submitted 2026-05-19 · 💻 cs.CV

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

Pith reviewed 2026-05-20 06:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot adaptationtext-prompted segmentationprototype learningdomain shiftsemantic segmentationinstance segmentationpanoptic segmentationvision-language models
0
0 comments X

The pith

PrAda learns visual prototypes from few examples to correct misclassifications in text-prompted segmentation models under domain shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that text-prompted segmentation models fail mainly due to misclassifying pixels when applied to specialized domains distant from their pre-training data, rather than failing to generate accurate masks. It introduces the task of few-shot visual adaptation for such models, which has been studied for classification but not segmentation. PrAda addresses this by learning class-specific prototypes that combine fine-grained pixel features with high-level transformer representations from a frozen base model. These prototypes are then fused with the original text-based predictions using a learned importance factor, allowing adaptation while preserving zero-shot performance on unseen classes.

Core claim

The central claim is that a parameter-efficient adaptation of frozen text-prompted segmentation models can be achieved by generating class-specific prototypes from combined pixel-level and transformer features, then blending those prototypes with text predictions via a learned weighting factor; this corrects domain-shift misclassifications across semantic, instance, and panoptic segmentation while retaining the model's ability to handle arbitrary text prompts without retraining.

What carries the argument

Prototype Adaptation (PrAda), which builds class-specific prototypes by merging fine-grained pixel features and high-level transformer representations then fuses them with text predictions through a learned importance factor to adjust classification under domain shift.

Load-bearing premise

That misclassification is the dominant error under domain shift and that fusing visual prototypes with text predictions will correct those errors without reducing performance on zero-shot classes or introducing new mistakes.

What would settle it

Measuring accuracy on a held-out specialized domain where the adapted model shows no gain over the frozen baseline or where zero-shot performance on new text prompts drops compared to the original model.

Figures

Figures reproduced from arXiv: 2605.19623 by Barbara Caputo, Carlo Masone, Fabio Cermelli, Gabriele Rosi.

Figure 1
Figure 1. Figure 1: Few-Shot Visual Adaptation for Text Prompted Seg￾mentation (FSVA-Seg). Text-prompted segmentation models of￾ten localize object regions accurately, yet struggle to assign the correct category label. We introduce the FSVA-Seg setting, where just a few labeled examples are available to guide the adaptation process. Our approach, PrAda, learns class-specific visual proto￾types from these examples to effective… view at source ↗
Figure 2
Figure 2. Figure 2: PrAda overview. An input image is passed through a frozen FC-CLIP [74] backbone to obtain visual features F and query representations Q. The segmentation masks from the decoder are used to pool features, which are then combined with their queries to form representations Φˆ. Cosine similarity with learnable class prototypes Φ, initialized from few-shot exemplars V, gives a visual similarity score Svisual. I… view at source ↗
Figure 3
Figure 3. Figure 3: Failure mode of text-prompted segmentation. (a) FC-CLIP [74] predictions across different datasets. Left: Example of failures, where the model accurately localizes the object mask, but wrongly classifies it. Right: Mask IoU distributions across three datasets show that predictions are heavily concentrated in high IoU ranges ([0.8, 1.0]), with the majority of masks achieving IoU > 0.5 with ground truth. Thi… view at source ↗
Figure 4
Figure 4. Figure 4: Prototypes initialization. Given and image first we employ the frozen version of FC-CLIP [74] to predict a series of mask where each is associated with a query qi. Given the set of visual example we perform (1) an IoU-based matching to obtain the queries relative to the predicted mask that better match with the visual examples and (2) a mask pooling operation to obtain a condensed feature representation. T… view at source ↗
Figure 5
Figure 5. Figure 5: Influence of the number of adaptation images per class. We evaluate our approach on ADE20K and Cityscapes us￾ing 2, 5, and 10 adaptation images per class, and compare it to the zero-shot baseline without prototypes. 33.2 (10 images). These results highlight the strong sam￾ple efficiency of our method: even with a single adaptation sample per class, we obtain sizable gains over the baseline, and steady impr… view at source ↗
Figure 1
Figure 1. Figure 1: Final α values across datasets. We report the learned α values after training convergence for standard benchmarks (ADE20K, Cityscapes, Mapillary Vistas) and individual datasets in the ShowOrTell [58] benchmark. The red dashed line indicates the initialization value (αinit = 80). imgs/class Prompt Type ADE20K Cityscapes PQ mIoU AP PQ mIoU AP 1 V+T 25.1±1.1 30.2±1.0 16.3±0.9 45.6±1.4 60.1±2.6 23.0±2.5 2 V+T … view at source ↗
Figure 2
Figure 2. Figure 2: Complete oracle results on SegInW [86] datasets. We report zero-shot performance (mAP) of the original FC-CLIP [74] model across all 25 datasets in the SegInW benchmark, evaluated only on the classes present in the ground-truth masks. While some datasets show zero-shot performance close to the oracle upper bound, indicating strong pre-training alignment with those visual domains, other datasets exhibit sig… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on ADE20K [81]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Cityscapes [12]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on Mapillary Vistas [54]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on ShowOrTell [58] datasets. We select a subset of datasets composed by: House-Parts, PIDray, Pizza and PASCAL VOC. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the problem of few-shot visual adaptation for text-prompted segmentation and proposes PrAda, a parameter-efficient method that adapts a frozen foundational model. It claims that error analysis under domain shift reveals misclassification (rather than mask generation) as the dominant failure mode. PrAda learns class-specific prototypes by combining fine-grained pixel features with high-level transformer representations and fuses them with the original text-based predictions through a learned importance factor. This is intended to enable strong adaptation on new domains while preserving zero-shot capability. Experiments across semantic, instance, and panoptic segmentation on five benchmarks are reported to yield significant improvements over state-of-the-art and proposed baselines.

Significance. If the central claims are substantiated, the work would be significant for extending few-shot adaptation techniques from classification to text-prompted segmentation, an area noted as largely unexplored. The emphasis on parameter efficiency, retention of zero-shot performance, and evaluation across multiple segmentation paradigms and benchmarks would provide a practical contribution for deploying foundational models in specialized domains.

major comments (2)
  1. [Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.
  2. [Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.
minor comments (1)
  1. [Abstract] The abstract mentions 'proposed baselines' without clarifying their construction or relation to existing few-shot segmentation methods; a dedicated section or table comparing against these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments identify important opportunities to strengthen the substantiation of our claims, and we address each point below with plans for revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.

    Authors: We acknowledge that the submitted manuscript does not explicitly reference quantitative error statistics in the abstract or main text to support this claim. Our internal analysis did identify misclassification as the dominant failure mode under domain shift. In the revised manuscript we will add a dedicated subsection (or expanded paragraph in the introduction) presenting the quantitative error breakdown, including failure-mode statistics and before/after comparisons on held-out domains. This will directly support the motivation for prototype fusion. revision: yes

  2. Referee: [Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.

    Authors: We agree that an explicit ablation isolating the fusion step and the learned importance factor would strengthen verification of the correction mechanism. We have conducted such ablations, which show measurable reductions in misclassifications while preserving zero-shot mask quality on held-out classes and domains. We will add these ablation results, along with the corresponding metrics and analysis, to the revised method and experiments sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and claims rest on external empirical validation

full rationale

The paper proposes PrAda as a new few-shot adaptation technique that learns class-specific prototypes from pixel features and transformer representations, then fuses them with text predictions via a learned importance factor. This construction introduces independent learned components rather than defining outputs in terms of the frozen model's inputs. Central claims of improvement are supported by experiments across semantic/instance/panoptic segmentation on five benchmarks, not by any derivation that reduces to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to force the result. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on one key domain assumption about the dominant error type under shift and introduces a small number of learned components whose values are determined during adaptation.

free parameters (1)
  • learned importance factor
    A scalar or per-class weight learned to balance prototype-based predictions against the original text-based predictions.
axioms (1)
  • domain assumption Misclassification rather than mask generation is the main culprit in performance degradation under domain shift for text-prompted segmentation models.
    Stated directly in the abstract as the result of studying existing methods' errors.
invented entities (1)
  • class-specific prototypes no independent evidence
    purpose: Compact representations that capture both fine-grained pixel features and high-level transformer features for each target class to enable adaptation.
    New construct introduced by the method; no independent evidence outside the adaptation process itself is described.

pith-pipeline@v0.9.0 · 5736 in / 1541 out tokens · 48154 ms · 2026-05-20T06:39:35.702366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 4 internal anchors

  1. [1]

    Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes

    Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes. In CVPR, 2022. 12

  2. [2]

    What’s the point: Semantic segmentation with point supervision

    Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InECCV, 2016. 1

  3. [3]

    Lan- guage models are few-shot learners.NeurIPS, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.NeurIPS, 2020. 3

  4. [4]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 3

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2, 7

  6. [6]

    PEM: Prototype-based efficient maskformer for image segmentation

    Niccolo Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. PEM: Prototype-based efficient maskformer for image segmentation. InCVPR, 2024. 1

  7. [7]

    Com- former: Continual learning in semantic and panoptic seg- mentation

    Fabio Cermelli, Matthieu Cord, and Arthur Douillard. Com- former: Continual learning in semantic and panoptic seg- mentation. InCVPR, 2023. 5

  8. [8]

    PLOT: Prompt learning with optimal transport for vision-language models

    Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InICLR,

  9. [9]

    Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021. 1

  10. [10]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

  11. [11]

    CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InCVPR, 2024. 2, 3

  12. [12]

    The Cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR,

  13. [13]

    2, 6, 7, 8, 12, 13, 14, 16, 18

  14. [14]

    Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model

    Claudia Cuttano, Antonio Tavera, Fabio Cermelli, Giuseppe Averta, and Barbara Caputo. Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model. InICCV, 2023. 1

  15. [15]

    SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation

    Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation. InCVPR,

  16. [16]

    BERT: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), 2019. 3

  17. [17]

    De- coupling zero-shot semantic segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InCVPR, 2022. 2

  18. [18]

    Open- vocabulary universal image segmentation with MaskCLIP

    Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. Preprint arXiv:2208.08984, 2022. 2, 3

  19. [19]

    A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice

    Takumi Ege, Wataru Shimoda, and Keiji Yanai. A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice. InPro- ceedings of the 5th international workshop on multimedia assisted dietary management, 2019. 12

  20. [20]

    The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010. 12

  21. [21]

    The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015

    Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015. 6

  22. [22]

    Rethinking few-shot adaptation of vision- language models in two stages

    Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InCVPR, 2025. 2, 3

  23. [23]

    CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024. 2, 3, 5, 6, 7, 12

  24. [24]

    Scal- ing open-vocabulary image segmentation with image-level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InECCV, 2022. 2

  25. [25]

    Text-guided visual prompt dino for generic segmentation

    Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, and Chen Li. Text-guided visual prompt dino for generic segmentation. InICCV, 2025. 2, 3, 7

  26. [26]

    kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024

    Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhao- chong An, Karsten Roth, Ameya Prabhu, and Philip Torr. kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024. 2, 3, 7

  27. [27]

    Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

    Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. InWACV, 2025. 7

  28. [28]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

  29. [29]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, 2019. 3 9

  30. [30]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 3

  31. [31]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2

  32. [32]

    T-Rex2: Towards generic object detec- tion via text-visual prompt synergy

    Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards generic object detec- tion via text-visual prompt synergy. InECCV, 2024. 2, 6

  33. [33]

    Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation

    Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yun- chao Wei, and Humphrey Shi. Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation. In ECCV, 2024. 2, 7, 13, 14

  34. [34]

    DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, 2025. 2

  35. [35]

    Your ViT is secretly an im- age segmentation model

    Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your ViT is secretly an im- age segmentation model. InCVPR, 2025. 1

  36. [36]

    MaPLe: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. InCVPR, 2023. 3, 5

  37. [37]

    Panoptic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InCVPR,

  38. [38]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 7, 8

  39. [39]

    ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation. In ECCV, 2024. 2, 7

  40. [40]

    Language-driven semantic seg- mentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. InICLR, 2022. 2, 3

  41. [41]

    Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation. InCVPR, 2023. 1

  42. [42]

    Visual in-context prompting

    Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InCVPR,

  43. [43]

    Multiple-Human Parsing in the Wild

    Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng. Multiple- human parsing in the wild.arXiv preprint arXiv:1705.07206,

  44. [44]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation.Preprint arXiv:2101.00190, 2021. 3

  45. [45]

    Learning without forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 5

  46. [46]

    Open-vocabulary semantic segmentation with mask-adapted CLIP

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted CLIP. InCVPR, 2023. 2

  47. [47]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 3, 6

  48. [48]

    GRES: Gen- eralized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. InCVPR, 2023. 1

  49. [49]

    A simple image segmentation framework via in-context examples.NeurIPS,

    Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, and Chunhua Shen. A simple image segmentation framework via in-context examples.NeurIPS,

  50. [50]

    Matcher: Segment anything with one shot using all-purpose feature matching

    Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching. InICLR, 2024. 7

  51. [51]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6

  52. [52]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 12

  53. [53]

    Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020

    Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 12

  54. [54]

    Simple open-vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In ECCV, 2022. 2

  55. [55]

    The Mapillary Vistas dataset for seman- tic understanding of street scenes

    Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for seman- tic understanding of street scenes. InICCV, 2017. 2, 6, 7, 12, 14, 16, 19

  56. [56]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Preprint arXiv:2304.07193, 2023. 1, 2, 7, 8

  57. [57]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2, 3, 4, 6, 7

  58. [58]

    SAM 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Seg- ment anything in images and videos. InICLR, 2025. 2

  59. [59]

    Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation

    Gabriele Rosi and Fabio Cermelli. Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation. InCVPR, 2025. 1, 2, 6, 7, 8, 12, 13, 14, 15, 16, 20 10

  60. [60]

    The revenge of bisenet: Efficient multi-task image segmentation

    Gabriele Rosi, Claudia Cuttano, Niccol `o Cavagnero, Giuseppe Averta, and Fabio Cermelli. The revenge of bisenet: Efficient multi-task image segmentation. InCVPR,

  61. [61]

    Aligning and prompting everything all at once for univer- sal visual perception

    Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InCVPR, 2024. 7

  62. [62]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3.Preprint arXiv:2508.10104, 2025. 2

  63. [63]

    Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InNeurIPS,

  64. [64]

    Images speak in images: A generalist painter for in-context visual learning

    Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InCVPR, 2023. 1

  65. [65]

    SegGPT: Segmenting ev- erything in context

    Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting ev- erything in context. InICCV, 2023. 1

  66. [66]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InCVPR, 2022. 3

  67. [67]

    CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

    Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim´eoni. CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InWACV, 2024. 2

  68. [68]

    CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

    Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. InECCV, 2024. 2

  69. [69]

    A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024

    Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024. 3

  70. [70]

    Open-vocabulary panop- tic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In CVPR, 2023. 1, 2, 5, 7

  71. [71]

    A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

    Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InECCV, 2022. 2

  72. [72]

    Side adapter network for open-vocabulary semantic segmentation

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InCVPR, 2023. 2, 3

  73. [73]

    Multi-modal queried object detection in the wild.NeurIPS, 2023

    Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.NeurIPS, 2023. 2

  74. [74]

    MMA: Multi-modal adapter for vision-language models

    Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. MMA: Multi-modal adapter for vision-language models. InCVPR, 2024. 3, 5

  75. [75]

    Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

  76. [76]

    1, 2, 3, 4, 5, 6, 7, 14, 15

  77. [77]

    Task residual for tuning vision-language models

    Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. InCVPR,

  78. [78]

    Open-vocabulary detr with conditional matching

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, 2022. 2

  79. [79]

    Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024

    Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yun- chao Wei. Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024. 7

  80. [80]

    A simple framework for open-vocabulary segmentation and detection

    Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InICCV,

Showing first 80 references.