COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection
Pith reviewed 2026-06-29 18:04 UTC · model grok-4.3
The pith
NoIn-Det injects novel object concepts into open-vocabulary detectors by updating only a small subset of text-branch parameters while freezing the visual encoder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NoIn-Det is an efficient continual injection framework without additional parameters. It freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning.
What carries the argument
NoIn-Det, a continual injection process that selectively updates a small subset of text-branch parameters while anchoring the text space with texts from known concepts.
If this is right
- Detectors can absorb new categories over time without full retraining or parameter growth.
- Old concepts and open-vocabulary capability remain intact while new ones are added.
- The same text-branch update rule works across multiple vision-language model backbones.
- A single small parameter subset suffices for each injection round, keeping compute low.
Where Pith is reading between the lines
- Future work could test whether the same text-anchoring trick extends to continual learning in other multimodal tasks such as captioning or visual question answering.
- If stronger visual encoders emerge, the method's gains on novel concepts should increase without any change to the injection procedure.
- The approach implies that continual open-vocabulary detection may be solved primarily through text-space management rather than joint visual-text retraining.
Load-bearing premise
Pretrained visual encoders already perceive and represent many novel concepts, so the main bottleneck is only the lack of stable semantic alignment between visual representations and textual concepts.
What would settle it
Run the method on a visual encoder deliberately chosen or trained to lack representations of the target novel concepts; if performance on those concepts fails to improve, the alignment-only premise is false.
Figures
read the original abstract
Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the COVD task setting for continual open-vocabulary object detection, where models must sequentially incorporate groups of novel concepts while retaining prior concepts and original open-vocabulary knowledge. It presents the Novel-114 benchmark and proposes NoIn-Det, which freezes the visual encoder, preserves the text representation space using common and previously injected concepts, and injects novel concepts by updating only a small beneficial subset of text-branch parameters. The central claim is that this yields effective novel-concept learning, knowledge preservation, and consistent outperformance over existing continual-learning methods for VLMs without introducing additional parameters.
Significance. If the core assumption holds and the reported experiments confirm both acquisition of novel concepts and retention of prior performance, the approach would offer a parameter-efficient route to handling evolving category spaces in open-vocabulary detection, avoiding full retraining costs while leveraging existing visual encoders.
major comments (2)
- [Abstract and method description] The load-bearing assumption that 'pretrained visual encoders often already perceive and represent many novel concepts' (with alignment as the sole bottleneck) receives no direct pre-injection verification. The manuscript should report feature-level diagnostics (e.g., nearest-neighbor retrieval or linear-probe accuracy) on Novel-114 novel concepts using the frozen visual encoder before any text-branch updates; without this, performance gains could arise from text updates compensating for missing visual capacity rather than confirming the stated premise.
- [Experiments] The claim that NoIn-Det 'consistently outperforms existing continual learning methods for VLMs' requires explicit quantification of knowledge preservation (e.g., mAP or AP on old vs. new concept groups across injection stages). If the experimental section only reports aggregate post-injection metrics, the no-forgetting guarantee remains unanchored.
minor comments (2)
- [Method] Specify the exact criterion or selection procedure used to identify the 'small subset of text-branch parameters beneficial to novel concept learning.'
- [Benchmark] Clarify whether Novel-114 concept groups are disjoint from the original training vocabulary of the base detector and how this affects the open-vocabulary evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our core assumptions and experimental evidence. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract and method description] The load-bearing assumption that 'pretrained visual encoders often already perceive and represent many novel concepts' (with alignment as the sole bottleneck) receives no direct pre-injection verification. The manuscript should report feature-level diagnostics (e.g., nearest-neighbor retrieval or linear-probe accuracy) on Novel-114 novel concepts using the frozen visual encoder before any text-branch updates; without this, performance gains could arise from text updates compensating for missing visual capacity rather than confirming the stated premise.
Authors: We agree that the manuscript would be strengthened by direct pre-injection verification of the visual encoder's capacity on novel concepts. While the design of NoIn-Det (freezing the visual encoder and updating only text parameters) and the observed performance gains are consistent with the stated premise, we did not include explicit feature-level diagnostics such as nearest-neighbor retrieval or linear-probe accuracy on the frozen encoder prior to injection. In the revised manuscript we will add these analyses on the Novel-114 novel concepts to provide direct support for the assumption. revision: yes
-
Referee: [Experiments] The claim that NoIn-Det 'consistently outperforms existing continual learning methods for VLMs' requires explicit quantification of knowledge preservation (e.g., mAP or AP on old vs. new concept groups across injection stages). If the experimental section only reports aggregate post-injection metrics, the no-forgetting guarantee remains unanchored.
Authors: We acknowledge that explicit per-stage breakdowns would more clearly anchor the knowledge-preservation claim. Our experiments already track performance on previously seen concepts to demonstrate retention, yet the presentation relies primarily on aggregate metrics. In the revision we will add detailed tables reporting mAP and AP separately for old versus new concept groups at each injection stage, alongside the existing aggregate results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces COVD as a new task setting and proposes NoIn-Det as an empirical continual injection framework. Its central claims rest on an explicit observation about pretrained visual encoders and the design choice to freeze the visual encoder while updating only text-branch parameters. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear that reduce the claimed performance gains to quantities defined by the method itself. The approach is self-contained as a standard empirical contribution with independent experimental validation on Novel-114.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zero-shot generalizable incremental learning for vision-language object detection.Advances in Neural Information Processing Systems, 37:136679–136700, 2024
Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, and Yunkuan Wang. Zero-shot generalizable incremental learning for vision-language object detection.Advances in Neural Information Processing Systems, 37:136679–136700, 2024. 2, 3, 7, 13
2024
-
[2]
Elevater: A benchmark and toolkit for evaluating language- augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301, 2022
Chunyuan Li, Haotian Liu, Liunian Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, et al. Elevater: A benchmark and toolkit for evaluating language- augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301, 2022. 2, 3
2022
-
[3]
Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models
Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. InEuropean conference on computer vision, pages 346–365. Springer, 2024. 2, 13
2024
-
[4]
Incremental mixture of experts: Continual learning for object detection in forestry scenarios.Pattern Recognition, page 113001, 2026
Ximeng Cheng, Qiaonan Zhu, Shukun Jia, Yichao Cao, and Xiaobo Lu. Incremental mixture of experts: Continual learning for object detection in forestry scenarios.Pattern Recognition, page 113001, 2026. 2, 7, 13
2026
-
[5]
Lada: Scalable label-specific clip adapter for continual learning,
Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, and Min-Ling Zhang. Lada: Scalable label-specific clip adapter for continual learning.arXiv preprint arXiv:2505.23271, 2025. 2, 7, 13
-
[6]
Boosting continual learning of vision-language models via mixture-of-experts adapters
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219–23230, 2024. 2, 7, 13
2024
-
[7]
Keeplora: Continual learning with residual gradient adaptation.arXiv preprint arXiv:2601.19659, 2026
Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei, and Min-Ling Zhang. Keeplora: Continual learning with residual gradient adaptation.arXiv preprint arXiv:2601.19659, 2026. 2, 7, 13
-
[8]
C-clip: Multimodal continual learning for vision- language model
Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 7
2025
-
[9]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364,
-
[10]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763, 2021. 3, 13
2021
-
[11]
Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models
Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14987–14997,
-
[12]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 248–255, 2009. 6, 16
2009
-
[13]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 6, 16
2019
-
[14]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755, 2014. 7
2014
-
[15]
Open-vocabulary object detection using captions
Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021. 13
2021
-
[16]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916, 2021. 13 10
2021
-
[17]
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2021. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Learning to prompt for open- vocabulary object detection with vision-language model
Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open- vocabulary object detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 13
2022
-
[19]
Object-aware distillation pyramid for open-vocabulary object detection
Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu. Object-aware distillation pyramid for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11186–11196, 2023. 13
2023
-
[20]
Distilling detr with visual-linguistic knowledge for open-vocabulary object detection
Liangqi Li, Jiaxu Miao, Dahu Shi, Wenming Tan, Ye Ren, Yi Yang, and Shiliang Pu. Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6501–6510, 2023. 13
2023
-
[21]
Detecting twenty- thousand classes using image-level supervision
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. Detecting twenty- thousand classes using image-level supervision. InEuropean Conference on Computer Vision, pages 350–368, 2022. 13
2022
-
[22]
Bridging the gap between object and image-level representations for open-vocabulary detection
Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shah- baz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35:33781–33794, 2022. 13
2022
-
[23]
Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection
Joonhyun Jeong, Geondo Park, Jayeon Yoo, Hyungsik Jung, and Heesu Kim. Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2462–2470, 2024. 13
2024
-
[24]
Learning background prompts to discover implicit knowledge for open vocabulary object detection
Jiaming Li, Jiacheng Zhang, Jichang Li, Ge Li, Si Liu, Liang Lin, and Guanbin Li. Learning background prompts to discover implicit knowledge for open vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16678–16687, 2024. 13
2024
-
[25]
Taming self-training for open-vocabulary object detection
Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Yumin Suh, Manmohan Chandraker, Dimitris N Metaxas, et al. Taming self-training for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13938–13947, 2024. 13
2024
-
[26]
Ov-dquo: Open- vocabulary detr with denoising text query training and open-world unknown objects supervision
Junjie Wang, Bin Chen, Bin Kang, Yulin Li, Weizhi Xian, Yichi Chen, and Yong Xu. Ov-dquo: Open- vocabulary detr with denoising text query training and open-world unknown objects supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7762–7770, 2025. 13
2025
-
[27]
Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models.arXiv preprint arXiv:2209.15639, 2022. 13
-
[28]
arXiv preprint arXiv:2310.01403 (2023)
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,
-
[29]
Dst-det: Open- vocabulary object detection via dynamic self-training.IEEE Transactions on Circuits and Systems for Video Technology, 2024
Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yunhai Tong, and Chen Change Loy. Dst-det: Open- vocabulary object detection via dynamic self-training.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 13
2024
-
[30]
Declip: Decoupled learning for open-vocabulary dense perception
Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open-vocabulary dense perception. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14824–14834, 2025. 13
2025
-
[31]
Region-aware pretraining for open-vocabulary object detection with vision transformers
Dahun Kim, Anelia Angelova, and Weicheng Kuo. Region-aware pretraining for open-vocabulary object detection with vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11144–11154, 2023. 13
2023
-
[32]
Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching
Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7031–7040, 2023. 13
2023
-
[33]
Yolo-world: Real- time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real- time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 13
2024
-
[34]
Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025
Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025. 13 11
-
[35]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975,
-
[36]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 13
2024
-
[37]
Preventing zero-shot transfer degradation in continual learning of vision-language models
Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023. 13 12 Appendix A Related Works Open-vocabulary object detection (OVD)[ 15] aims to detect...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.