arxiv: 2604.04444 · v1 · submitted 2026-04-06 · 💻 cs.CV

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

Weihao Cao , Runqi Wang , Xiaoyue Duan , Jinchao Zhang , Ang Yang , Liping Jing This is my paper

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary object detectionsemantic augmentationparameter-efficient adaptationdomain shiftmulti-scale promptsvision-language modelsprompt bankdynamic router

0 comments p. Extension

The pith

HSA-DINO augments open-vocabulary detectors with multi-scale semantic prompts and a dynamic router to close performance gaps on domain-shifted tasks without losing broad generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Open-vocabulary object detection models trained on large general datasets perform well in familiar settings but drop sharply when moved to specialized domains where labels are scarce or semantically weak. The paper proposes HSA-DINO to add richer auxiliary semantics to the text side of these models through a bank of prompts extracted from image feature pyramids at multiple scales. A semantic-aware router then chooses which prompts to apply during inference, enriching representations progressively from coarse to fine without any parameter changes to the underlying detector. This targets the core mismatch between pre-training data and downstream domain needs while protecting the model's ability to recognize categories never seen in training. Readers would care because successful adaptation would make existing large-scale detectors immediately useful for narrow real-world applications such as medical or industrial scenes.

Core claim

HSA-DINO is a parameter-efficient semantic augmentation framework for open-vocabulary object detection. It introduces a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. A semantic-aware router dynamically selects the appropriate semantic augmentation strategy at inference time, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. Experiments on OV-COCO, several vertical domain datasets, and modified benchmarks show improved performance over prior state-of-

What carries the argument

The multi-scale prompt bank together with the semantic-aware router: the bank extracts hierarchical semantics from image pyramids to form domain-specific prompts, while the router selects which prompts to apply dynamically so that textual features are enriched without retraining the detector.

If this is right

Detection accuracy rises on vertical domain datasets that exhibit large visual shifts from the original pre-training distribution.
The model maintains or improves its ability to detect arbitrary unseen categories even after the semantic additions are applied.
No parameter updates to the core detector are required, so the same weights can be reused across many different downstream domains.
The approach yields a better measured trade-off than earlier methods between domain-specific gains and retention of open-vocabulary capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt-based augmentation at inference time may serve as a lightweight substitute for domain-specific fine-tuning across other vision-language tasks.
The layered coarse-to-fine enrichment implies that semantic detail should be supplied incrementally rather than in a single step for optimal stability.
The router's selection logic could be inspected after deployment to identify which semantic scales matter most for particular kinds of domain shift.

Load-bearing premise

That the multi-scale prompt bank can reliably extract useful domain-specific local semantics and the router can select them without introducing noise that harms the model's original open-vocabulary recognition of unseen categories.

What would settle it

Apply the full HSA-DINO system and the baseline detector to a held-out domain dataset containing many novel categories; if mean average precision on the novel categories falls below the baseline while domain-specific gains appear only on seen classes, the claim that generalization is preserved would be refuted.

Figures

Figures reproduced from arXiv: 2604.04444 by Ang Yang, Jinchao Zhang, Liping Jing, Runqi Wang, Weihao Cao, Xiaoyue Duan.

**Figure 2.** Figure 2: Method motivation. (a) Previous methods use predefined templates or learnable vectors prepended to the category label embed [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed HSA-DINO framework. We incorporate LoRA into the image encoder during training on downstream [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on (a) bank size N, (b) prompt length M, (c) key matching loss weight λm, and (d) orthogonal loss weight λp. Method HArTaxOr HDIOR HUODD Hmean Predefined + SAR 54.6 47.3 47.7 49.9 CoOp [39] + SAR 57.1 51.1 48.0 52.1 AttriCLIP [30] + SAR 58.8 51.6 48.5 53.0 MSPB + SAR (ours) 60.5 53.0 49.6 54.4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 7.** Figure 7: The probability distribution of reconstruction errors from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the selected prompts at different textual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HSA-DINO adds a multi-scale prompt bank and semantic-aware router to adapt OVOD models like DINO to domain shifts with low parameters, but the router's contribution to preserving generalization is not isolated enough to fully support the trade-off claim.

read the letter

The main point is that this paper gives a concrete, parameter-efficient way to adapt pre-trained OVOD detectors to vertical domains by enriching text prompts hierarchically and routing them dynamically. The HSA-DINO setup uses image feature pyramids to build a multi-scale prompt bank that adds coarse-to-fine semantics, then a router picks the right augmentation at inference time so the frozen backbone does not lose its open-vocabulary strength. That combination is the actual new piece; prior prompt-tuning work in OVOD did not pair multi-scale selection with this kind of inference-time router for domain-specific labels. The motivation is clear and the engineering is straightforward, which is useful when you cannot afford to retrain the whole model on scarce domain data. The experiments reportedly compare against prior SOTA on OV-COCO plus several vertical sets and claim a better adaptability-versus-generalization balance. That is the part worth looking at if you work on deployment. The soft spot is exactly where the stress-test note flags it: the router is load-bearing for the no-degradation claim, yet there is no reported routing accuracy, no ablation that replaces the learned router with random selection, and no direct head-to-head of OV-COCO AP with and without the router on the same backbone. Without those, it is hard to know whether gains come from the prompts themselves or from the router avoiding bad choices. The paper is aimed at CV researchers who adapt large detectors to narrow domains such as industrial inspection or medical imaging. Readers who need a drop-in tweak to DINO-style models will get practical ideas here. It is worth sending to peer review because the core problem is real and the method is reproducible enough to test, even if the router validation needs tightening.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes HSA-DINO, a parameter-efficient semantic augmentation framework for open-vocabulary object detection. It introduces a multi-scale prompt bank leveraging image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, plus a semantic-aware router that dynamically chooses augmentation strategies at inference to avoid degrading the frozen pre-trained OVOD model's generalization. The authors claim this yields a superior trade-off between domain adaptability and open-vocabulary performance on OV-COCO and vertical-domain benchmarks.

Significance. If the results and ablations hold, the approach offers a practical, low-parameter method for adapting pre-trained OVOD models to domain-shifted tasks without full retraining or loss of broad generalization, which addresses a common deployment limitation in the field.

major comments (2)

The headline claim of a superior domain-adaptability vs. generalization trade-off rests on the semantic-aware router correctly choosing prompts from the multi-scale bank at inference time without eroding OV-COCO performance. No routing accuracy metric, no ablation isolating the router (learned vs. random selection), and no direct OV-COCO AP comparison with vs. without the router are supplied to validate this.
The abstract asserts superior performance and a favorable trade-off but supplies no quantitative numbers, ablation results, or experimental protocol, making it impossible to judge whether the data actually support the stated claims about the router and prompt bank.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: The headline claim of a superior domain-adaptability vs. generalization trade-off rests on the semantic-aware router correctly choosing prompts from the multi-scale bank at inference time without eroding OV-COCO performance. No routing accuracy metric, no ablation isolating the router (learned vs. random selection), and no direct OV-COCO AP comparison with vs. without the router are supplied to validate this.

Authors: We agree that dedicated validation of the semantic-aware router would strengthen the evidence for our headline claim. The current manuscript reports overall performance gains on OV-COCO and domain-shifted benchmarks but does not include a routing accuracy metric, a learned-versus-random router ablation, or an explicit with/without-router OV-COCO AP comparison. In the revised manuscript we will add all three: (1) a routing accuracy metric, (2) an ablation isolating learned router selection against random selection, and (3) a direct OV-COCO AP comparison with and without the router. These additions will more clearly isolate the router’s contribution to preserving generalization. revision: yes
Referee: The abstract asserts superior performance and a favorable trade-off but supplies no quantitative numbers, ablation results, or experimental protocol, making it impossible to judge whether the data actually support the stated claims about the router and prompt bank.

Authors: We acknowledge that the abstract would benefit from concrete quantitative support. We will revise the abstract to include key performance numbers (e.g., AP improvements on OV-COCO and vertical-domain benchmarks) and a brief reference to the ablation studies that underpin the claims regarding the multi-scale prompt bank and semantic-aware router. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks are independent of internal definitions

full rationale

The paper proposes HSA-DINO with a multi-scale prompt bank and semantic-aware router, then reports measured performance gains on OV-COCO and vertical-domain datasets against prior methods. No equations, parameter fits presented as predictions, or self-citation chains appear in the text. All claims reduce to direct experimental comparisons on held-out data rather than any definitional or fitted reduction to the method's own components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone contains no explicit free parameters, axioms, or invented entities that can be extracted.

pith-pipeline@v0.9.0 · 5533 in / 1009 out tokens · 49200 ms · 2026-05-10T18:51:28.302048+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-scale prompt bank ... semantic-aware router that dynamically selects ... reconstruction error
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HSA-DINO ... parameter-efficient semantic augmentation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 1, 3, 5, 6

work page 2024
[2]

Evaluating large-vocabulary object detectors: The devil is in the details.arXiv preprint, 2021

Achal Dave, Piotr Doll ´ar, Deva Ramanan, Alexander Kir- illov, and Ross Girshick. Evaluating large-vocabulary object detectors: The devil is in the details.arXiv preprint, 2021. 6

work page 2021
[3]

Zero-shot generalizable incremental learning for vision-language object detection

Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingx- uan Zhang, and Yunkuan Wang. Zero-shot generalizable incremental learning for vision-language object detection. Advances in Neural Information Processing Systems, 37: 136679–136700, 2024. 3, 5, 6

work page 2024
[4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4171–4186, 2019. 5

work page 2019
[5]

Mr-gdino: efficient open-world continual object detection.arXiv preprint, 2024

Bowen Dong, Zitong Huang, Guanglei Yang, Lei Zhang, and Wangmeng Zuo. Mr-gdino: efficient open-world continual object detection.arXiv preprint, 2024. 3, 5, 6

work page 2024
[6]

Arthropod taxonomy orders object detection dataset, 2019

Geir Drange. Arthropod taxonomy orders object detection dataset, 2019. 1, 2, 5

work page 2019
[7]

Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint, 2021

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint, 2021. 3

work page 2021
[8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 5

work page 2016
[9]

Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022. 3, 5

work page 2022
[10]

Un- derwater species detection using channel sharpening atten- tion

Lihao Jiang, Yi Wang, Qi Jia, Shengwei Xu, Yu Liu, Xin Fan, Haojie Li, Risheng Liu, Xinwei Xue, and Ruili Wang. Un- derwater species detection using channel sharpening atten- tion. InProceedings of the ACM International Conference on Multimedia, pages 4259–4267, 2021. 1, 2, 5

work page 2021
[11]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 1, 3

work page 2021
[12]

Dn-detr: Accelerate detr training by intro- ducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022. 5

work page 2022
[13]

Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 1, 2, 5

work page 2020
[14]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 1, 3, 5, 6

work page 2022
[15]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 2, 5

work page 2014
[16]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 2980–2988, 2017. 5

work page 2017
[17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[18]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5

work page 2021
[19]

Decoupled weight decay regularization.arXiv preprint, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint, 2017. 5

work page 2017
[20]

Flickr30k entities: Collecting region-to-phrase correspon- dences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspon- dences for richer image-to-sentence models. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 2641–2649, 2015. 1, 3

work page 2015
[21]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3

work page 2021
[22]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666,

work page
[23]

A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951. 6

work page 1951
[24]

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint, 2019

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint, 2019. 4

work page 2019
[25]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE/CVF international conference on computer vision, 2017. 8

work page 2017
[26]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 1

work page 2019
[27]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the Annual Meeting of the Association for Com- putational Linguistics, pages 2556–2565, 2018. 3

work page 2018
[28]

The ubiquitous kronecker product

Charles F Van Loan. The ubiquitous kronecker product. Journal of computational and applied mathematics, 123(1- 2):85–100, 2000. 3

work page 2000
[29]

Ov-dino: Unified open-vocabulary de- tection with language-aware selective fusion.arXiv preprint,

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, et al. Ov-dino: Unified open-vocabulary de- tection with language-aware selective fusion.arXiv preprint,

work page
[30]

Attriclip: A non-incremental learner for incremental knowledge learning

Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, Jinhu L ¨u, and Baochang Zhang. Attriclip: A non-incremental learner for incremental knowledge learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3654–3663, 2023. 3, 7

work page 2023
[31]

Learning to prompt for continual learning

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149,

work page
[32]

Zero-shot learning-the good, the bad and the ugly

Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4582–4591, 2017. 6

work page 2017
[33]

Boosting continual learning of vision-language models via mixture-of-experts adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23219–23230, 2024. 2, 7

work page 2024
[34]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 3

work page 2022
[35]

Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint, 2022

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint, 2022. 2, 4

work page 2022
[36]

Tip- adapter: Training-free adaption of clip for few-shot classi- fication

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. InEuropean conference on computer vision, pages 493–510. Springer, 2022. 3

work page 2022
[37]

Regionclip: Region- based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region- based language-image pretraining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022. 3

work page 2022
[38]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 16816–16825,

work page
[39]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page