pith. sign in

arxiv: 2605.27116 · v1 · pith:YRJ673ZJnew · submitted 2026-05-26 · 💻 cs.CV

COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection

Pith reviewed 2026-06-29 18:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords continual learningopen-vocabulary object detectionvision-language modelsconcept injectioncatastrophic forgettingnovel concept learning
0
0 comments X

The pith

NoIn-Det injects novel object concepts into open-vocabulary detectors by updating only a small subset of text-branch parameters while freezing the visual encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called COVD where detectors must sequentially incorporate groups of emerging object categories without losing prior knowledge or the ability to handle unseen classes. It starts from the observation that pretrained visual encoders already encode many new concepts, shifting the core problem to creating stable alignment between those visual features and text labels. NoIn-Det solves this by keeping the visual encoder fixed, using texts from known concepts to hold the text space steady, and adjusting only a limited portion of the text branch for each new group. This design adds no parameters and avoids the usual trade-off between new learning and old retention. Tests on the introduced Novel-114 benchmark show the approach surpasses standard continual-learning techniques applied to vision-language detectors.

Core claim

NoIn-Det is an efficient continual injection framework without additional parameters. It freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning.

What carries the argument

NoIn-Det, a continual injection process that selectively updates a small subset of text-branch parameters while anchoring the text space with texts from known concepts.

If this is right

  • Detectors can absorb new categories over time without full retraining or parameter growth.
  • Old concepts and open-vocabulary capability remain intact while new ones are added.
  • The same text-branch update rule works across multiple vision-language model backbones.
  • A single small parameter subset suffices for each injection round, keeping compute low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether the same text-anchoring trick extends to continual learning in other multimodal tasks such as captioning or visual question answering.
  • If stronger visual encoders emerge, the method's gains on novel concepts should increase without any change to the injection procedure.
  • The approach implies that continual open-vocabulary detection may be solved primarily through text-space management rather than joint visual-text retraining.

Load-bearing premise

Pretrained visual encoders already perceive and represent many novel concepts, so the main bottleneck is only the lack of stable semantic alignment between visual representations and textual concepts.

What would settle it

Run the method on a visual encoder deliberately chosen or trained to lack representations of the target novel concepts; if performance on those concepts fails to improve, the alignment-only premise is false.

Figures

Figures reproduced from arXiv: 2605.27116 by Liang Wan, Ruize Han, Yuntong Tian, Yupeng Zhang, Yuzhong Feng, Zixin Ren.

Figure 1
Figure 1. Figure 1: Visualization of features from the pre￾trained visual backbone. Backbone outputs are pro￾jected to RGB space using PCA. Although the ob￾jects are novel concepts unseen during pretraining, the backbone highlights object regions and produces coherent responses within semantic areas. In recent years, open-vocabulary object detec￾tion (OVD) has extended object detection be￾yond fixed category sets by leveragin… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the multi-modal represen [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of NoIn-Det. Built on LLMDet, NoIn-Det freezes the visual encoder and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sunburst visualization of the task-wise composition and data distribution of Novel-114. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Novel-114. The benchmark contains diverse novel, rare, and specialized [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of different methods on Novel-114. Our method produces more [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the COVD task setting for continual open-vocabulary object detection, where models must sequentially incorporate groups of novel concepts while retaining prior concepts and original open-vocabulary knowledge. It presents the Novel-114 benchmark and proposes NoIn-Det, which freezes the visual encoder, preserves the text representation space using common and previously injected concepts, and injects novel concepts by updating only a small beneficial subset of text-branch parameters. The central claim is that this yields effective novel-concept learning, knowledge preservation, and consistent outperformance over existing continual-learning methods for VLMs without introducing additional parameters.

Significance. If the core assumption holds and the reported experiments confirm both acquisition of novel concepts and retention of prior performance, the approach would offer a parameter-efficient route to handling evolving category spaces in open-vocabulary detection, avoiding full retraining costs while leveraging existing visual encoders.

major comments (2)
  1. [Abstract and method description] The load-bearing assumption that 'pretrained visual encoders often already perceive and represent many novel concepts' (with alignment as the sole bottleneck) receives no direct pre-injection verification. The manuscript should report feature-level diagnostics (e.g., nearest-neighbor retrieval or linear-probe accuracy) on Novel-114 novel concepts using the frozen visual encoder before any text-branch updates; without this, performance gains could arise from text updates compensating for missing visual capacity rather than confirming the stated premise.
  2. [Experiments] The claim that NoIn-Det 'consistently outperforms existing continual learning methods for VLMs' requires explicit quantification of knowledge preservation (e.g., mAP or AP on old vs. new concept groups across injection stages). If the experimental section only reports aggregate post-injection metrics, the no-forgetting guarantee remains unanchored.
minor comments (2)
  1. [Method] Specify the exact criterion or selection procedure used to identify the 'small subset of text-branch parameters beneficial to novel concept learning.'
  2. [Benchmark] Clarify whether Novel-114 concept groups are disjoint from the original training vocabulary of the base detector and how this affects the open-vocabulary evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our core assumptions and experimental evidence. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and method description] The load-bearing assumption that 'pretrained visual encoders often already perceive and represent many novel concepts' (with alignment as the sole bottleneck) receives no direct pre-injection verification. The manuscript should report feature-level diagnostics (e.g., nearest-neighbor retrieval or linear-probe accuracy) on Novel-114 novel concepts using the frozen visual encoder before any text-branch updates; without this, performance gains could arise from text updates compensating for missing visual capacity rather than confirming the stated premise.

    Authors: We agree that the manuscript would be strengthened by direct pre-injection verification of the visual encoder's capacity on novel concepts. While the design of NoIn-Det (freezing the visual encoder and updating only text parameters) and the observed performance gains are consistent with the stated premise, we did not include explicit feature-level diagnostics such as nearest-neighbor retrieval or linear-probe accuracy on the frozen encoder prior to injection. In the revised manuscript we will add these analyses on the Novel-114 novel concepts to provide direct support for the assumption. revision: yes

  2. Referee: [Experiments] The claim that NoIn-Det 'consistently outperforms existing continual learning methods for VLMs' requires explicit quantification of knowledge preservation (e.g., mAP or AP on old vs. new concept groups across injection stages). If the experimental section only reports aggregate post-injection metrics, the no-forgetting guarantee remains unanchored.

    Authors: We acknowledge that explicit per-stage breakdowns would more clearly anchor the knowledge-preservation claim. Our experiments already track performance on previously seen concepts to demonstrate retention, yet the presentation relies primarily on aggregate metrics. In the revision we will add detailed tables reporting mAP and AP separately for old versus new concept groups at each injection stage, alongside the existing aggregate results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces COVD as a new task setting and proposes NoIn-Det as an empirical continual injection framework. Its central claims rest on an explicit observation about pretrained visual encoders and the design choice to freeze the visual encoder while updating only text-branch parameters. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear that reduce the claimed performance gains to quantities defined by the method itself. The approach is self-contained as a standard empirical contribution with independent experimental validation on Novel-114.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of specific free parameters, axioms, or invented entities; no numerical fitting or new postulated objects are mentioned.

pith-pipeline@v0.9.1-grok · 5771 in / 1106 out tokens · 35882 ms · 2026-06-29T18:04:38.568265+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Zero-shot generalizable incremental learning for vision-language object detection.Advances in Neural Information Processing Systems, 37:136679–136700, 2024

    Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, and Yunkuan Wang. Zero-shot generalizable incremental learning for vision-language object detection.Advances in Neural Information Processing Systems, 37:136679–136700, 2024. 2, 3, 7, 13

  2. [2]

    Elevater: A benchmark and toolkit for evaluating language- augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301, 2022

    Chunyuan Li, Haotian Liu, Liunian Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, et al. Elevater: A benchmark and toolkit for evaluating language- augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301, 2022. 2, 3

  3. [3]

    Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models

    Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. InEuropean conference on computer vision, pages 346–365. Springer, 2024. 2, 13

  4. [4]

    Incremental mixture of experts: Continual learning for object detection in forestry scenarios.Pattern Recognition, page 113001, 2026

    Ximeng Cheng, Qiaonan Zhu, Shukun Jia, Yichao Cao, and Xiaobo Lu. Incremental mixture of experts: Continual learning for object detection in forestry scenarios.Pattern Recognition, page 113001, 2026. 2, 7, 13

  5. [5]

    Lada: Scalable label-specific clip adapter for continual learning,

    Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, and Min-Ling Zhang. Lada: Scalable label-specific clip adapter for continual learning.arXiv preprint arXiv:2505.23271, 2025. 2, 7, 13

  6. [6]

    Boosting continual learning of vision-language models via mixture-of-experts adapters

    Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219–23230, 2024. 2, 7, 13

  7. [7]

    Keeplora: Continual learning with residual gradient adaptation.arXiv preprint arXiv:2601.19659, 2026

    Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei, and Min-Ling Zhang. Keeplora: Continual learning with residual gradient adaptation.arXiv preprint arXiv:2601.19659, 2026. 2, 7, 13

  8. [8]

    C-clip: Multimodal continual learning for vision- language model

    Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 7

  9. [9]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364,

  10. [10]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763, 2021. 3, 13

  11. [11]

    Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

    Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14987–14997,

  12. [12]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 248–255, 2009. 6, 16

  13. [13]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 6, 16

  14. [14]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755, 2014. 7

  15. [15]

    Open-vocabulary object detection using captions

    Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021. 13

  16. [16]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916, 2021. 13 10

  17. [17]

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2021. 13

  18. [18]

    Learning to prompt for open- vocabulary object detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open- vocabulary object detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 13

  19. [19]

    Object-aware distillation pyramid for open-vocabulary object detection

    Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu. Object-aware distillation pyramid for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11186–11196, 2023. 13

  20. [20]

    Distilling detr with visual-linguistic knowledge for open-vocabulary object detection

    Liangqi Li, Jiaxu Miao, Dahu Shi, Wenming Tan, Ye Ren, Yi Yang, and Shiliang Pu. Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6501–6510, 2023. 13

  21. [21]

    Detecting twenty- thousand classes using image-level supervision

    Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. Detecting twenty- thousand classes using image-level supervision. InEuropean Conference on Computer Vision, pages 350–368, 2022. 13

  22. [22]

    Bridging the gap between object and image-level representations for open-vocabulary detection

    Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shah- baz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35:33781–33794, 2022. 13

  23. [23]

    Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection

    Joonhyun Jeong, Geondo Park, Jayeon Yoo, Hyungsik Jung, and Heesu Kim. Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2462–2470, 2024. 13

  24. [24]

    Learning background prompts to discover implicit knowledge for open vocabulary object detection

    Jiaming Li, Jiacheng Zhang, Jichang Li, Ge Li, Si Liu, Liang Lin, and Guanbin Li. Learning background prompts to discover implicit knowledge for open vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16678–16687, 2024. 13

  25. [25]

    Taming self-training for open-vocabulary object detection

    Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Yumin Suh, Manmohan Chandraker, Dimitris N Metaxas, et al. Taming self-training for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13938–13947, 2024. 13

  26. [26]

    Ov-dquo: Open- vocabulary detr with denoising text query training and open-world unknown objects supervision

    Junjie Wang, Bin Chen, Bin Kang, Yulin Li, Weizhi Xian, Yichi Chen, and Yong Xu. Ov-dquo: Open- vocabulary detr with denoising text query training and open-world unknown objects supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7762–7770, 2025. 13

  27. [27]

    F-vlm: Open-vocabulary object detection upon frozen vision and language models.arXiv preprint arXiv:2209.15639, 2022

    Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models.arXiv preprint arXiv:2209.15639, 2022. 13

  28. [28]

    arXiv preprint arXiv:2310.01403 (2023)

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,

  29. [29]

    Dst-det: Open- vocabulary object detection via dynamic self-training.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yunhai Tong, and Chen Change Loy. Dst-det: Open- vocabulary object detection via dynamic self-training.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 13

  30. [30]

    Declip: Decoupled learning for open-vocabulary dense perception

    Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open-vocabulary dense perception. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14824–14834, 2025. 13

  31. [31]

    Region-aware pretraining for open-vocabulary object detection with vision transformers

    Dahun Kim, Anelia Angelova, and Weicheng Kuo. Region-aware pretraining for open-vocabulary object detection with vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11144–11154, 2023. 13

  32. [32]

    Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching

    Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7031–7040, 2023. 13

  33. [33]

    Yolo-world: Real- time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real- time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 13

  34. [34]

    Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025

    Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025. 13 11

  35. [35]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975,

  36. [36]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 13

  37. [37]

    Preventing zero-shot transfer degradation in continual learning of vision-language models

    Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023. 13 12 Appendix A Related Works Open-vocabulary object detection (OVD)[ 15] aims to detect...