pith. the verified trust layer for science. sign in

arxiv: 2604.04444 · v1 · submitted 2026-04-06 · 💻 cs.CV

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary object detectionsemantic augmentationparameter-efficient adaptationdomain shiftmulti-scale promptsvision-language modelsprompt bankdynamic router
0
0 comments X p. Extension

The pith

HSA-DINO augments open-vocabulary detectors with multi-scale semantic prompts and a dynamic router to close performance gaps on domain-shifted tasks without losing broad generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Open-vocabulary object detection models trained on large general datasets perform well in familiar settings but drop sharply when moved to specialized domains where labels are scarce or semantically weak. The paper proposes HSA-DINO to add richer auxiliary semantics to the text side of these models through a bank of prompts extracted from image feature pyramids at multiple scales. A semantic-aware router then chooses which prompts to apply during inference, enriching representations progressively from coarse to fine without any parameter changes to the underlying detector. This targets the core mismatch between pre-training data and downstream domain needs while protecting the model's ability to recognize categories never seen in training. Readers would care because successful adaptation would make existing large-scale detectors immediately useful for narrow real-world applications such as medical or industrial scenes.

Core claim

HSA-DINO is a parameter-efficient semantic augmentation framework for open-vocabulary object detection. It introduces a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. A semantic-aware router dynamically selects the appropriate semantic augmentation strategy at inference time, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. Experiments on OV-COCO, several vertical domain datasets, and modified benchmarks show improved performance over prior state-of-

What carries the argument

The multi-scale prompt bank together with the semantic-aware router: the bank extracts hierarchical semantics from image pyramids to form domain-specific prompts, while the router selects which prompts to apply dynamically so that textual features are enriched without retraining the detector.

If this is right

  • Detection accuracy rises on vertical domain datasets that exhibit large visual shifts from the original pre-training distribution.
  • The model maintains or improves its ability to detect arbitrary unseen categories even after the semantic additions are applied.
  • No parameter updates to the core detector are required, so the same weights can be reused across many different downstream domains.
  • The approach yields a better measured trade-off than earlier methods between domain-specific gains and retention of open-vocabulary capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt-based augmentation at inference time may serve as a lightweight substitute for domain-specific fine-tuning across other vision-language tasks.
  • The layered coarse-to-fine enrichment implies that semantic detail should be supplied incrementally rather than in a single step for optimal stability.
  • The router's selection logic could be inspected after deployment to identify which semantic scales matter most for particular kinds of domain shift.

Load-bearing premise

That the multi-scale prompt bank can reliably extract useful domain-specific local semantics and the router can select them without introducing noise that harms the model's original open-vocabulary recognition of unseen categories.

What would settle it

Apply the full HSA-DINO system and the baseline detector to a held-out domain dataset containing many novel categories; if mean average precision on the novel categories falls below the baseline while domain-specific gains appear only on seen classes, the claim that generalization is preserved would be refuted.

Figures

Figures reproduced from arXiv: 2604.04444 by Ang Yang, Jinchao Zhang, Liping Jing, Runqi Wang, Weihao Cao, Xiaoyue Duan.

Figure 1
Figure 1. Figure 1: Pre-trained OVOD models perform well on general do [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method motivation. (a) Previous methods use predefined templates or learnable vectors prepended to the category label embed [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed HSA-DINO framework. We incorporate LoRA into the image encoder during training on downstream [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on (a) bank size N, (b) prompt length M, (c) key matching loss weight λm, and (d) orthogonal loss weight λp. Method HArTaxOr HDIOR HUODD Hmean Predefined + SAR 54.6 47.3 47.7 49.9 CoOp [39] + SAR 57.1 51.1 48.0 52.1 AttriCLIP [30] + SAR 58.8 51.6 48.5 53.0 MSPB + SAR (ours) 60.5 53.0 49.6 54.4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: The probability distribution of reconstruction errors from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the selected prompts at different textual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes HSA-DINO, a parameter-efficient semantic augmentation framework for open-vocabulary object detection. It introduces a multi-scale prompt bank leveraging image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, plus a semantic-aware router that dynamically chooses augmentation strategies at inference to avoid degrading the frozen pre-trained OVOD model's generalization. The authors claim this yields a superior trade-off between domain adaptability and open-vocabulary performance on OV-COCO and vertical-domain benchmarks.

Significance. If the results and ablations hold, the approach offers a practical, low-parameter method for adapting pre-trained OVOD models to domain-shifted tasks without full retraining or loss of broad generalization, which addresses a common deployment limitation in the field.

major comments (2)
  1. The headline claim of a superior domain-adaptability vs. generalization trade-off rests on the semantic-aware router correctly choosing prompts from the multi-scale bank at inference time without eroding OV-COCO performance. No routing accuracy metric, no ablation isolating the router (learned vs. random selection), and no direct OV-COCO AP comparison with vs. without the router are supplied to validate this.
  2. The abstract asserts superior performance and a favorable trade-off but supplies no quantitative numbers, ablation results, or experimental protocol, making it impossible to judge whether the data actually support the stated claims about the router and prompt bank.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: The headline claim of a superior domain-adaptability vs. generalization trade-off rests on the semantic-aware router correctly choosing prompts from the multi-scale bank at inference time without eroding OV-COCO performance. No routing accuracy metric, no ablation isolating the router (learned vs. random selection), and no direct OV-COCO AP comparison with vs. without the router are supplied to validate this.

    Authors: We agree that dedicated validation of the semantic-aware router would strengthen the evidence for our headline claim. The current manuscript reports overall performance gains on OV-COCO and domain-shifted benchmarks but does not include a routing accuracy metric, a learned-versus-random router ablation, or an explicit with/without-router OV-COCO AP comparison. In the revised manuscript we will add all three: (1) a routing accuracy metric, (2) an ablation isolating learned router selection against random selection, and (3) a direct OV-COCO AP comparison with and without the router. These additions will more clearly isolate the router’s contribution to preserving generalization. revision: yes

  2. Referee: The abstract asserts superior performance and a favorable trade-off but supplies no quantitative numbers, ablation results, or experimental protocol, making it impossible to judge whether the data actually support the stated claims about the router and prompt bank.

    Authors: We acknowledge that the abstract would benefit from concrete quantitative support. We will revise the abstract to include key performance numbers (e.g., AP improvements on OV-COCO and vertical-domain benchmarks) and a brief reference to the ablation studies that underpin the claims regarding the multi-scale prompt bank and semantic-aware router. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks are independent of internal definitions

full rationale

The paper proposes HSA-DINO with a multi-scale prompt bank and semantic-aware router, then reports measured performance gains on OV-COCO and vertical-domain datasets against prior methods. No equations, parameter fits presented as predictions, or self-citation chains appear in the text. All claims reduce to direct experimental comparisons on held-out data rather than any definitional or fitted reduction to the method's own components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone contains no explicit free parameters, axioms, or invented entities that can be extracted.

pith-pipeline@v0.9.0 · 5533 in / 1009 out tokens · 49200 ms · 2026-05-10T18:51:28.302048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 1, 3, 5, 6

  2. [2]

    Evaluating large-vocabulary object detectors: The devil is in the details.arXiv preprint, 2021

    Achal Dave, Piotr Doll ´ar, Deva Ramanan, Alexander Kir- illov, and Ross Girshick. Evaluating large-vocabulary object detectors: The devil is in the details.arXiv preprint, 2021. 6

  3. [3]

    Zero-shot generalizable incremental learning for vision-language object detection

    Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingx- uan Zhang, and Yunkuan Wang. Zero-shot generalizable incremental learning for vision-language object detection. Advances in Neural Information Processing Systems, 37: 136679–136700, 2024. 3, 5, 6

  4. [4]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4171–4186, 2019. 5

  5. [5]

    Mr-gdino: efficient open-world continual object detection.arXiv preprint, 2024

    Bowen Dong, Zitong Huang, Guanglei Yang, Lei Zhang, and Wangmeng Zuo. Mr-gdino: efficient open-world continual object detection.arXiv preprint, 2024. 3, 5, 6

  6. [6]

    Arthropod taxonomy orders object detection dataset, 2019

    Geir Drange. Arthropod taxonomy orders object detection dataset, 2019. 1, 2, 5

  7. [7]

    Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint, 2021

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint, 2021. 3

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 5

  9. [9]

    Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022. 3, 5

  10. [10]

    Un- derwater species detection using channel sharpening atten- tion

    Lihao Jiang, Yi Wang, Qi Jia, Shengwei Xu, Yu Liu, Xin Fan, Haojie Li, Risheng Liu, Xinwei Xue, and Ruili Wang. Un- derwater species detection using channel sharpening atten- tion. InProceedings of the ACM International Conference on Multimedia, pages 4259–4267, 2021. 1, 2, 5

  11. [11]

    Mdetr- modulated detection for end-to-end multi-modal understand- ing

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 1, 3

  12. [12]

    Dn-detr: Accelerate detr training by intro- ducing query denoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022. 5

  13. [13]

    Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 1, 2, 5

  14. [14]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 1, 3, 5, 6

  15. [15]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 2, 5

  16. [16]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 2980–2988, 2017. 5

  17. [17]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  18. [18]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5

  19. [19]

    Decoupled weight decay regularization.arXiv preprint, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint, 2017. 5

  20. [20]

    Flickr30k entities: Collecting region-to-phrase correspon- dences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspon- dences for richer image-to-sentence models. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 2641–2649, 2015. 1, 3

  21. [21]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3

  22. [22]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666,

  23. [23]

    A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951. 6

  24. [24]

    Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint, 2019

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint, 2019. 4

  25. [25]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE/CVF international conference on computer vision, 2017. 8

  26. [26]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 1

  27. [27]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the Annual Meeting of the Association for Com- putational Linguistics, pages 2556–2565, 2018. 3

  28. [28]

    The ubiquitous kronecker product

    Charles F Van Loan. The ubiquitous kronecker product. Journal of computational and applied mathematics, 123(1- 2):85–100, 2000. 3

  29. [29]

    Ov-dino: Unified open-vocabulary de- tection with language-aware selective fusion.arXiv preprint,

    Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, et al. Ov-dino: Unified open-vocabulary de- tection with language-aware selective fusion.arXiv preprint,

  30. [30]

    Attriclip: A non-incremental learner for incremental knowledge learning

    Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, Jinhu L ¨u, and Baochang Zhang. Attriclip: A non-incremental learner for incremental knowledge learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3654–3663, 2023. 3, 7

  31. [31]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149,

  32. [32]

    Zero-shot learning-the good, the bad and the ugly

    Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4582–4591, 2017. 6

  33. [33]

    Boosting continual learning of vision-language models via mixture-of-experts adapters

    Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23219–23230, 2024. 2, 7

  34. [34]

    Open-vocabulary detr with conditional matching

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 3

  35. [35]

    Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint, 2022

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint, 2022. 2, 4

  36. [36]

    Tip- adapter: Training-free adaption of clip for few-shot classi- fication

    Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. InEuropean conference on computer vision, pages 493–510. Springer, 2022. 3

  37. [37]

    Regionclip: Region- based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region- based language-image pretraining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022. 3

  38. [38]

    Conditional prompt learning for vision-language mod- els

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 16816–16825,

  39. [39]

    Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,