PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

Barbara Caputo; Carlo Masone; Fabio Cermelli; Gabriele Rosi

arxiv: 2605.19623 · v1 · pith:64RS6UZQnew · submitted 2026-05-19 · 💻 cs.CV

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

Gabriele Rosi , Fabio Cermelli , Carlo Masone , Barbara Caputo This is my paper

Pith reviewed 2026-05-20 06:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-shot adaptationtext-prompted segmentationprototype learningdomain shiftsemantic segmentationinstance segmentationpanoptic segmentationvision-language models

0 comments

The pith

PrAda learns visual prototypes from few examples to correct misclassifications in text-prompted segmentation models under domain shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that text-prompted segmentation models fail mainly due to misclassifying pixels when applied to specialized domains distant from their pre-training data, rather than failing to generate accurate masks. It introduces the task of few-shot visual adaptation for such models, which has been studied for classification but not segmentation. PrAda addresses this by learning class-specific prototypes that combine fine-grained pixel features with high-level transformer representations from a frozen base model. These prototypes are then fused with the original text-based predictions using a learned importance factor, allowing adaptation while preserving zero-shot performance on unseen classes.

Core claim

The central claim is that a parameter-efficient adaptation of frozen text-prompted segmentation models can be achieved by generating class-specific prototypes from combined pixel-level and transformer features, then blending those prototypes with text predictions via a learned weighting factor; this corrects domain-shift misclassifications across semantic, instance, and panoptic segmentation while retaining the model's ability to handle arbitrary text prompts without retraining.

What carries the argument

Prototype Adaptation (PrAda), which builds class-specific prototypes by merging fine-grained pixel features and high-level transformer representations then fuses them with text predictions through a learned importance factor to adjust classification under domain shift.

Load-bearing premise

That misclassification is the dominant error under domain shift and that fusing visual prototypes with text predictions will correct those errors without reducing performance on zero-shot classes or introducing new mistakes.

What would settle it

Measuring accuracy on a held-out specialized domain where the adapted model shows no gain over the frozen baseline or where zero-shot performance on new text prompts drops compared to the original model.

Figures

Figures reproduced from arXiv: 2605.19623 by Barbara Caputo, Carlo Masone, Fabio Cermelli, Gabriele Rosi.

**Figure 1.** Figure 1: Few-Shot Visual Adaptation for Text Prompted Segmentation (FSVA-Seg). Text-prompted segmentation models often localize object regions accurately, yet struggle to assign the correct category label. We introduce the FSVA-Seg setting, where just a few labeled examples are available to guide the adaptation process. Our approach, PrAda, learns class-specific visual prototypes from these examples to effective… view at source ↗

**Figure 2.** Figure 2: PrAda overview. An input image is passed through a frozen FC-CLIP [74] backbone to obtain visual features F and query representations Q. The segmentation masks from the decoder are used to pool features, which are then combined with their queries to form representations Φˆ. Cosine similarity with learnable class prototypes Φ, initialized from few-shot exemplars V, gives a visual similarity score Svisual. I… view at source ↗

**Figure 3.** Figure 3: Failure mode of text-prompted segmentation. (a) FC-CLIP [74] predictions across different datasets. Left: Example of failures, where the model accurately localizes the object mask, but wrongly classifies it. Right: Mask IoU distributions across three datasets show that predictions are heavily concentrated in high IoU ranges ([0.8, 1.0]), with the majority of masks achieving IoU > 0.5 with ground truth. Thi… view at source ↗

**Figure 4.** Figure 4: Prototypes initialization. Given and image first we employ the frozen version of FC-CLIP [74] to predict a series of mask where each is associated with a query qi. Given the set of visual example we perform (1) an IoU-based matching to obtain the queries relative to the predicted mask that better match with the visual examples and (2) a mask pooling operation to obtain a condensed feature representation. T… view at source ↗

**Figure 5.** Figure 5: Influence of the number of adaptation images per class. We evaluate our approach on ADE20K and Cityscapes using 2, 5, and 10 adaptation images per class, and compare it to the zero-shot baseline without prototypes. 33.2 (10 images). These results highlight the strong sample efficiency of our method: even with a single adaptation sample per class, we obtain sizable gains over the baseline, and steady impr… view at source ↗

**Figure 1.** Figure 1: Final α values across datasets. We report the learned α values after training convergence for standard benchmarks (ADE20K, Cityscapes, Mapillary Vistas) and individual datasets in the ShowOrTell [58] benchmark. The red dashed line indicates the initialization value (αinit = 80). imgs/class Prompt Type ADE20K Cityscapes PQ mIoU AP PQ mIoU AP 1 V+T 25.1±1.1 30.2±1.0 16.3±0.9 45.6±1.4 60.1±2.6 23.0±2.5 2 V+T … view at source ↗

**Figure 2.** Figure 2: Complete oracle results on SegInW [86] datasets. We report zero-shot performance (mAP) of the original FC-CLIP [74] model across all 25 datasets in the SegInW benchmark, evaluated only on the classes present in the ground-truth masks. While some datasets show zero-shot performance close to the oracle upper bound, indicating strong pre-training alignment with those visual domains, other datasets exhibit sig… view at source ↗

**Figure 3.** Figure 3: Qualitative results on ADE20K [81]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on Cityscapes [12]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on Mapillary Vistas [54]. For each image, the first row contains panoptic segmentation predictions, while the second row contains semantic segmentation predictions. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on ShowOrTell [58] datasets. We select a subset of datasets composed by: House-Parts, PIDray, Pizza and PASCAL VOC. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrAda frames few-shot adaptation for text-prompted segmentation as a distinct problem and uses learned prototypes to target misclassifications under domain shift.

read the letter

PrAda frames few-shot adaptation for text-prompted segmentation as a distinct problem and uses learned prototypes to target misclassifications under domain shift. The authors look at where current text-prompted models break on specialized domains and conclude that misclassification drives most of the errors rather than problems with the actual mask shapes. They define the new task around this observation and introduce PrAda, which builds class-specific prototypes from both fine-grained pixel features and higher-level transformer representations, then fuses them with the original text predictions through a learned importance factor. The method stays parameter-efficient and is designed to leave zero-shot behavior intact for classes outside the few-shot examples. Experiments run across semantic, instance, and panoptic segmentation on five benchmarks and show gains over prior methods plus some new baselines they add. That coverage across task types is a practical strength. The softer spots sit in the supporting analysis. The abstract states the misclassification finding but gives no quantitative error breakdown or before-and-after numbers to confirm it is the dominant failure mode. There are also no reported ablations that isolate the fusion step or metrics that check zero-shot retention on held-out classes and domains. Without those, it is harder to rule out that the gains come from generic regularization rather than the targeted fix. This work is aimed at researchers adapting foundational segmentation models to new domains when pixel labels are scarce. Readers interested in few-shot techniques or domain adaptation for vision-language models will find a concrete starting point. The problem setup and method have enough substance to deserve peer review so the experimental details and controls can be examined closely.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the problem of few-shot visual adaptation for text-prompted segmentation and proposes PrAda, a parameter-efficient method that adapts a frozen foundational model. It claims that error analysis under domain shift reveals misclassification (rather than mask generation) as the dominant failure mode. PrAda learns class-specific prototypes by combining fine-grained pixel features with high-level transformer representations and fuses them with the original text-based predictions through a learned importance factor. This is intended to enable strong adaptation on new domains while preserving zero-shot capability. Experiments across semantic, instance, and panoptic segmentation on five benchmarks are reported to yield significant improvements over state-of-the-art and proposed baselines.

Significance. If the central claims are substantiated, the work would be significant for extending few-shot adaptation techniques from classification to text-prompted segmentation, an area noted as largely unexplored. The emphasis on parameter efficiency, retention of zero-shot performance, and evaluation across multiple segmentation paradigms and benchmarks would provide a practical contribution for deploying foundational models in specialized domains.

major comments (2)

[Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.
[Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.

minor comments (1)

[Abstract] The abstract mentions 'proposed baselines' without clarifying their construction or relation to existing few-shot segmentation methods; a dedicated section or table comparing against these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments identify important opportunities to strengthen the substantiation of our claims, and we address each point below with plans for revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.

Authors: We acknowledge that the submitted manuscript does not explicitly reference quantitative error statistics in the abstract or main text to support this claim. Our internal analysis did identify misclassification as the dominant failure mode under domain shift. In the revised manuscript we will add a dedicated subsection (or expanded paragraph in the introduction) presenting the quantitative error breakdown, including failure-mode statistics and before/after comparisons on held-out domains. This will directly support the motivation for prototype fusion. revision: yes
Referee: [Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.

Authors: We agree that an explicit ablation isolating the fusion step and the learned importance factor would strengthen verification of the correction mechanism. We have conducted such ablations, which show measurable reductions in misclassifications while preserving zero-shot mask quality on held-out classes and domains. We will add these ablation results, along with the corresponding metrics and analysis, to the revised method and experiments sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and claims rest on external empirical validation

full rationale

The paper proposes PrAda as a new few-shot adaptation technique that learns class-specific prototypes from pixel features and transformer representations, then fuses them with text predictions via a learned importance factor. This construction introduces independent learned components rather than defining outputs in terms of the frozen model's inputs. Central claims of improvement are supported by experiments across semantic/instance/panoptic segmentation on five benchmarks, not by any derivation that reduces to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to force the result. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on one key domain assumption about the dominant error type under shift and introduces a small number of learned components whose values are determined during adaptation.

free parameters (1)

learned importance factor
A scalar or per-class weight learned to balance prototype-based predictions against the original text-based predictions.

axioms (1)

domain assumption Misclassification rather than mask generation is the main culprit in performance degradation under domain shift for text-prompted segmentation models.
Stated directly in the abstract as the result of studying existing methods' errors.

invented entities (1)

class-specific prototypes no independent evidence
purpose: Compact representations that capture both fine-grained pixel features and high-level transformer features for each target class to enable adaptation.
New construct introduced by the method; no independent evidence outside the adaptation process itself is described.

pith-pipeline@v0.9.0 · 5736 in / 1541 out tokens · 48154 ms · 2026-05-20T06:39:35.702366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 4 internal anchors

[1]

Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes

Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes. In CVPR, 2022. 12

work page 2022
[2]

What’s the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InECCV, 2016. 1

work page 2016
[3]

Lan- guage models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.NeurIPS, 2020. 3

work page 2020
[4]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 3

work page 2020
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2, 7

work page 2021
[6]

PEM: Prototype-based efficient maskformer for image segmentation

Niccolo Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. PEM: Prototype-based efficient maskformer for image segmentation. InCVPR, 2024. 1

work page 2024
[7]

Com- former: Continual learning in semantic and panoptic seg- mentation

Fabio Cermelli, Matthieu Cord, and Arthur Douillard. Com- former: Continual learning in semantic and panoptic seg- mentation. InCVPR, 2023. 5

work page 2023
[8]

PLOT: Prompt learning with optimal transport for vision-language models

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InICLR,

work page
[9]

Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021. 1

work page 2021
[10]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

work page
[11]

CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InCVPR, 2024. 2, 3

work page 2024
[12]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR,

work page
[13]

2, 6, 7, 8, 12, 13, 14, 16, 18

work page
[14]

Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model

Claudia Cuttano, Antonio Tavera, Fabio Cermelli, Giuseppe Averta, and Barbara Caputo. Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model. InICCV, 2023. 1

work page 2023
[15]

SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation. InCVPR,

work page
[16]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), 2019. 3

work page 2019
[17]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InCVPR, 2022. 2

work page 2022
[18]

Open- vocabulary universal image segmentation with MaskCLIP

Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. Preprint arXiv:2208.08984, 2022. 2, 3

work page arXiv 2022
[19]

A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice

Takumi Ege, Wataru Shimoda, and Keiji Yanai. A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice. InPro- ceedings of the 5th international workshop on multimedia assisted dietary management, 2019. 12

work page 2019
[20]

The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010. 12

work page 2010
[21]

The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015

Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015. 6

work page 2015
[22]

Rethinking few-shot adaptation of vision- language models in two stages

Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InCVPR, 2025. 2, 3

work page 2025
[23]

CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024. 2, 3, 5, 6, 7, 12

work page 2024
[24]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InECCV, 2022. 2

work page 2022
[25]

Text-guided visual prompt dino for generic segmentation

Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, and Chen Li. Text-guided visual prompt dino for generic segmentation. InICCV, 2025. 2, 3, 7

work page 2025
[26]

kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhao- chong An, Karsten Roth, Ameya Prabhu, and Philip Torr. kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024. 2, 3, 7

work page 2024
[27]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. InWACV, 2025. 7

work page 2025
[28]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

work page
[29]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, 2019. 3 9

work page 2019
[30]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 3

work page 2022
[31]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2

work page 2021
[32]

T-Rex2: Towards generic object detec- tion via text-visual prompt synergy

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards generic object detec- tion via text-visual prompt synergy. InECCV, 2024. 2, 6

work page 2024
[33]

Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation

Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yun- chao Wei, and Humphrey Shi. Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation. In ECCV, 2024. 2, 7, 13, 14

work page 2024
[34]

DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, 2025. 2

work page 2025
[35]

Your ViT is secretly an im- age segmentation model

Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your ViT is secretly an im- age segmentation model. InCVPR, 2025. 1

work page 2025
[36]

MaPLe: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. InCVPR, 2023. 3, 5

work page 2023
[37]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InCVPR,

work page
[38]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 7, 8

work page 2023
[39]

ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation. In ECCV, 2024. 2, 7

work page 2024
[40]

Language-driven semantic seg- mentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. InICLR, 2022. 2, 3

work page 2022
[41]

Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation. InCVPR, 2023. 1

work page 2023
[42]

Visual in-context prompting

Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InCVPR,

work page
[43]

Multiple-Human Parsing in the Wild

Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng. Multiple- human parsing in the wild.arXiv preprint arXiv:1705.07206,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation.Preprint arXiv:2101.00190, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 5

work page 2017
[46]

Open-vocabulary semantic segmentation with mask-adapted CLIP

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted CLIP. InCVPR, 2023. 2

work page 2023
[47]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 3, 6

work page 2014
[48]

GRES: Gen- eralized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. InCVPR, 2023. 1

work page 2023
[49]

A simple image segmentation framework via in-context examples.NeurIPS,

Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, and Chunhua Shen. A simple image segmentation framework via in-context examples.NeurIPS,

work page
[50]

Matcher: Segment anything with one shot using all-purpose feature matching

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching. InICLR, 2024. 7

work page 2024
[51]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6

work page 2022
[52]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 12

work page 2019
[53]

Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020

Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 12

work page 2020
[54]

Simple open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In ECCV, 2022. 2

work page 2022
[55]

The Mapillary Vistas dataset for seman- tic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for seman- tic understanding of street scenes. InICCV, 2017. 2, 6, 7, 12, 14, 16, 19

work page 2017
[56]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Preprint arXiv:2304.07193, 2023. 1, 2, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2, 3, 4, 6, 7

work page 2021
[58]

SAM 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Seg- ment anything in images and videos. InICLR, 2025. 2

work page 2025
[59]

Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation

Gabriele Rosi and Fabio Cermelli. Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation. InCVPR, 2025. 1, 2, 6, 7, 8, 12, 13, 14, 15, 16, 20 10

work page 2025
[60]

The revenge of bisenet: Efficient multi-task image segmentation

Gabriele Rosi, Claudia Cuttano, Niccol `o Cavagnero, Giuseppe Averta, and Fabio Cermelli. The revenge of bisenet: Efficient multi-task image segmentation. InCVPR,

work page
[61]

Aligning and prompting everything all at once for univer- sal visual perception

Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InCVPR, 2024. 7

work page 2024
[62]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3.Preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InNeurIPS,

work page
[64]

Images speak in images: A generalist painter for in-context visual learning

Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InCVPR, 2023. 1

work page 2023
[65]

SegGPT: Segmenting ev- erything in context

Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting ev- erything in context. InICCV, 2023. 1

work page 2023
[66]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InCVPR, 2022. 3

work page 2022
[67]

CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim´eoni. CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InWACV, 2024. 2

work page 2024
[68]

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. InECCV, 2024. 2

work page 2024
[69]

A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024

Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024. 3

work page 2024
[70]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In CVPR, 2023. 1, 2, 5, 7

work page 2023
[71]

A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InECCV, 2022. 2

work page 2022
[72]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InCVPR, 2023. 2, 3

work page 2023
[73]

Multi-modal queried object detection in the wild.NeurIPS, 2023

Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.NeurIPS, 2023. 2

work page 2023
[74]

MMA: Multi-modal adapter for vision-language models

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. MMA: Multi-modal adapter for vision-language models. InCVPR, 2024. 3, 5

work page 2024
[75]

Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

work page
[76]

1, 2, 3, 4, 5, 6, 7, 14, 15

work page
[77]

Task residual for tuning vision-language models

Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. InCVPR,

work page
[78]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, 2022. 2

work page 2022
[79]

Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024

Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yun- chao Wei. Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024. 7

work page 2024
[80]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InICCV,

work page

Showing first 80 references.

[1] [1]

Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes

Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes. In CVPR, 2022. 12

work page 2022

[2] [2]

What’s the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InECCV, 2016. 1

work page 2016

[3] [3]

Lan- guage models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.NeurIPS, 2020. 3

work page 2020

[4] [4]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 3

work page 2020

[5] [5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2, 7

work page 2021

[6] [6]

PEM: Prototype-based efficient maskformer for image segmentation

Niccolo Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. PEM: Prototype-based efficient maskformer for image segmentation. InCVPR, 2024. 1

work page 2024

[7] [7]

Com- former: Continual learning in semantic and panoptic seg- mentation

Fabio Cermelli, Matthieu Cord, and Arthur Douillard. Com- former: Continual learning in semantic and panoptic seg- mentation. InCVPR, 2023. 5

work page 2023

[8] [8]

PLOT: Prompt learning with optimal transport for vision-language models

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InICLR,

work page

[9] [9]

Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021. 1

work page 2021

[10] [10]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

work page

[11] [11]

CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InCVPR, 2024. 2, 3

work page 2024

[12] [12]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR,

work page

[13] [13]

2, 6, 7, 8, 12, 13, 14, 16, 18

work page

[14] [14]

Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model

Claudia Cuttano, Antonio Tavera, Fabio Cermelli, Giuseppe Averta, and Barbara Caputo. Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model. InICCV, 2023. 1

work page 2023

[15] [15]

SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation. InCVPR,

work page

[16] [16]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), 2019. 3

work page 2019

[17] [17]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InCVPR, 2022. 2

work page 2022

[18] [18]

Open- vocabulary universal image segmentation with MaskCLIP

Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. Preprint arXiv:2208.08984, 2022. 2, 3

work page arXiv 2022

[19] [19]

A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice

Takumi Ege, Wataru Shimoda, and Keiji Yanai. A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice. InPro- ceedings of the 5th international workshop on multimedia assisted dietary management, 2019. 12

work page 2019

[20] [20]

The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010. 12

work page 2010

[21] [21]

The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015

Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015. 6

work page 2015

[22] [22]

Rethinking few-shot adaptation of vision- language models in two stages

Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InCVPR, 2025. 2, 3

work page 2025

[23] [23]

CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024. 2, 3, 5, 6, 7, 12

work page 2024

[24] [24]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InECCV, 2022. 2

work page 2022

[25] [25]

Text-guided visual prompt dino for generic segmentation

Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, and Chen Li. Text-guided visual prompt dino for generic segmentation. InICCV, 2025. 2, 3, 7

work page 2025

[26] [26]

kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhao- chong An, Karsten Roth, Ameya Prabhu, and Philip Torr. kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024. 2, 3, 7

work page 2024

[27] [27]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. InWACV, 2025. 7

work page 2025

[28] [28]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

work page

[29] [29]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, 2019. 3 9

work page 2019

[30] [30]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 3

work page 2022

[31] [31]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2

work page 2021

[32] [32]

T-Rex2: Towards generic object detec- tion via text-visual prompt synergy

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards generic object detec- tion via text-visual prompt synergy. InECCV, 2024. 2, 6

work page 2024

[33] [33]

Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation

Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yun- chao Wei, and Humphrey Shi. Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation. In ECCV, 2024. 2, 7, 13, 14

work page 2024

[34] [34]

DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, 2025. 2

work page 2025

[35] [35]

Your ViT is secretly an im- age segmentation model

Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your ViT is secretly an im- age segmentation model. InCVPR, 2025. 1

work page 2025

[36] [36]

MaPLe: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. InCVPR, 2023. 3, 5

work page 2023

[37] [37]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InCVPR,

work page

[38] [38]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 7, 8

work page 2023

[39] [39]

ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation. In ECCV, 2024. 2, 7

work page 2024

[40] [40]

Language-driven semantic seg- mentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. InICLR, 2022. 2, 3

work page 2022

[41] [41]

Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation. InCVPR, 2023. 1

work page 2023

[42] [42]

Visual in-context prompting

Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InCVPR,

work page

[43] [43]

Multiple-Human Parsing in the Wild

Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng. Multiple- human parsing in the wild.arXiv preprint arXiv:1705.07206,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation.Preprint arXiv:2101.00190, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [45]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 5

work page 2017

[46] [46]

Open-vocabulary semantic segmentation with mask-adapted CLIP

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted CLIP. InCVPR, 2023. 2

work page 2023

[47] [47]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 3, 6

work page 2014

[48] [48]

GRES: Gen- eralized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. InCVPR, 2023. 1

work page 2023

[49] [49]

A simple image segmentation framework via in-context examples.NeurIPS,

Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, and Chunhua Shen. A simple image segmentation framework via in-context examples.NeurIPS,

work page

[50] [50]

Matcher: Segment anything with one shot using all-purpose feature matching

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching. InICLR, 2024. 7

work page 2024

[51] [51]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6

work page 2022

[52] [52]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 12

work page 2019

[53] [53]

Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020

Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 12

work page 2020

[54] [54]

Simple open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In ECCV, 2022. 2

work page 2022

[55] [55]

The Mapillary Vistas dataset for seman- tic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for seman- tic understanding of street scenes. InICCV, 2017. 2, 6, 7, 12, 14, 16, 19

work page 2017

[56] [56]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Preprint arXiv:2304.07193, 2023. 1, 2, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2, 3, 4, 6, 7

work page 2021

[58] [58]

SAM 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Seg- ment anything in images and videos. InICLR, 2025. 2

work page 2025

[59] [59]

Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation

Gabriele Rosi and Fabio Cermelli. Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation. InCVPR, 2025. 1, 2, 6, 7, 8, 12, 13, 14, 15, 16, 20 10

work page 2025

[60] [60]

The revenge of bisenet: Efficient multi-task image segmentation

Gabriele Rosi, Claudia Cuttano, Niccol `o Cavagnero, Giuseppe Averta, and Fabio Cermelli. The revenge of bisenet: Efficient multi-task image segmentation. InCVPR,

work page

[61] [61]

Aligning and prompting everything all at once for univer- sal visual perception

Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InCVPR, 2024. 7

work page 2024

[62] [62]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3.Preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InNeurIPS,

work page

[64] [64]

Images speak in images: A generalist painter for in-context visual learning

Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InCVPR, 2023. 1

work page 2023

[65] [65]

SegGPT: Segmenting ev- erything in context

Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting ev- erything in context. InICCV, 2023. 1

work page 2023

[66] [66]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InCVPR, 2022. 3

work page 2022

[67] [67]

CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim´eoni. CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InWACV, 2024. 2

work page 2024

[68] [68]

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. InECCV, 2024. 2

work page 2024

[69] [69]

A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024

Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024. 3

work page 2024

[70] [70]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In CVPR, 2023. 1, 2, 5, 7

work page 2023

[71] [71]

A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InECCV, 2022. 2

work page 2022

[72] [72]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InCVPR, 2023. 2, 3

work page 2023

[73] [73]

Multi-modal queried object detection in the wild.NeurIPS, 2023

Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.NeurIPS, 2023. 2

work page 2023

[74] [74]

MMA: Multi-modal adapter for vision-language models

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. MMA: Multi-modal adapter for vision-language models. InCVPR, 2024. 3, 5

work page 2024

[75] [75]

Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,

work page

[76] [76]

1, 2, 3, 4, 5, 6, 7, 14, 15

work page

[77] [77]

Task residual for tuning vision-language models

Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. InCVPR,

work page

[78] [78]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, 2022. 2

work page 2022

[79] [79]

Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024

Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yun- chao Wei. Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024. 7

work page 2024

[80] [80]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InICCV,

work page