arxiv: 2605.03642 · v1 · submitted 2026-05-05 · 💻 cs.CV

The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

Yazhe Wan , Changjae Oh (Queen Mary University of London) This is my paper

Pith reviewed 2026-05-07 17:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary object detectionself-supervised learningvision-language modelspseudo-labelingmodel adaptationcooperative detectionfine-tuning

0 comments

The pith

Self-supervised fine-tuning with detector pseudo-labels improves vision-language models for open-vocabulary object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Decoupled Adaptivity Training (DAT) to adapt vision-language models for better region-level object detection in an open-vocabulary setting. It uses a closed-set detector to generate pseudo-labels on regions, including novel objects, then fine-tunes only the VLM's visual backbone in a decoupled way that preserves global semantics through weight interpolation. This lightweight approach, tuning under 0.8 million parameters, requires no extra labels or inference cost. Tests on COCO and LVIS datasets demonstrate gains for both new and familiar categories, setting a new standard for cooperative open-vocabulary detection methods.

Core claim

Decoupled Adaptivity Training constructs a region-aware pseudo-labeled dataset from a pre-trained closed-set detector and fine-tunes the VLM visual backbone decoupledly to enhance local feature alignment while preserving global knowledge via weight interpolation, leading to improved performance on novel and known object categories in cooperative open-vocabulary detection.

What carries the argument

Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning method that uses pseudo-labels from a closed-set detector to adapt the VLM's visual backbone with weight interpolation.

If this is right

Consistent performance gains on both novel and known categories in open-vocabulary detection.
No additional inference overhead or labeled data required.
Applicable as a plug-and-play module to existing cooperative models.
Fine-tunes fewer than 0.8M parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may enable iterative self-improvement by using the adapted VLM to refine future pseudo-labels.
It highlights a general strategy for bridging image-level pretraining of VLMs with region-level tasks using auxiliary detectors.
Similar adaptation could extend to other dense prediction tasks like segmentation.

Load-bearing premise

Pseudo-labels produced by the closed-set detector for novel object regions are accurate and useful enough to guide beneficial adaptation without adding harmful noise or bias.

What would settle it

A controlled test where the closed-set detector's pseudo-labels are deliberately made inaccurate on novel objects, followed by observing whether DAT still improves or instead degrades VLM performance.

Figures

Figures reproduced from arXiv: 2605.03642 by Changjae Oh (Queen Mary University of London), Yazhe Wan.

**Figure 1.** Figure 1: Global-local feature imbalance in VLMs under cooperative closed-set and view at source ↗

**Figure 2.** Figure 2: The Cooperative Foundational Models (CFM) pipeline with the VLM view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Decoupled Adaptivity Training (DAT). Using view at source ↗

**Figure 4.** Figure 4: Sample results of Cooperative Foundational Models (CFM) and ours. view at source ↗

read the original abstract

Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAT is a lightweight plug-in that fine-tunes only the VLM backbone with region pseudo-labels from a closed detector plus weight interpolation, but the noise in those labels for novel objects is the main risk that needs checking.

read the letter

The main takeaway is that this paper gives a simple self-supervised way to adapt the visual backbone of a VLM inside a cooperative open-vocabulary detector. They run a pre-trained closed-set detector to make region pseudo-labels, fine-tune just the backbone in a decoupled fashion, and then interpolate weights back toward the original to keep global semantics. The whole thing touches under 0.8M parameters and adds nothing at test time. Experiments on COCO and LVIS are said to lift both novel and base categories, setting a new mark for this style of model.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Decoupled Adaptivity Training (DAT), a lightweight self-supervised adaptation method for vision-language models (VLMs) used in cooperative open-vocabulary object detection. Given a closed-set detector paired with a VLM, DAT generates region-aware pseudo-labels by running the detector over images (where novel-object regions may be unlabeled or mislabeled), then fine-tunes only the VLM visual backbone in a decoupled fashion while using weight interpolation to retain global semantics. The method is presented as plug-and-play with <0.8M tunable parameters and no inference cost. Experiments on COCO and LVIS are claimed to show consistent gains on both novel and base categories, establishing a new state of the art.

Significance. If the reported gains are robust, DAT offers an efficient way to improve local feature alignment in image-level pre-trained VLMs for region-level detection tasks without retraining the full model or adding runtime overhead. This could strengthen cooperative detector+VLM pipelines for open-vocabulary settings by providing a simple adaptation step that leverages existing closed-set detectors.

major comments (2)

[Method section (pseudo-label dataset construction)] The central claim that pseudo-labels from the closed-set detector produce a net-positive adaptation signal for novel-object regions is load-bearing, yet the manuscript provides no quantitative assessment of pseudo-label accuracy, noise rate, or bias specifically on novel categories (e.g., fraction of novel regions assigned wrong base-class labels or background). Without such analysis or a controlled ablation (clean vs. noisy labels), it is unclear whether the observed gains arise from beneficial local alignment or from other factors.
[Experiments section] The experiments section claims consistent improvements and new SOTA on COCO and LVIS for both novel and known categories, but the provided abstract and summary contain no numerical metrics, baseline tables, or ablation results on the decoupled fine-tuning versus weight-interpolation components. This prevents verification that the gains are attributable to DAT rather than the underlying cooperative model or training details.

minor comments (1)

[Abstract] The abstract would benefit from at least one or two key quantitative results (e.g., mAP gains on novel classes) to substantiate the 'consistent improvements' and 'new SOTA' statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate the revisions we will make.

read point-by-point responses

Referee: [Method section (pseudo-label dataset construction)] The central claim that pseudo-labels from the closed-set detector produce a net-positive adaptation signal for novel-object regions is load-bearing, yet the manuscript provides no quantitative assessment of pseudo-label accuracy, noise rate, or bias specifically on novel categories (e.g., fraction of novel regions assigned wrong base-class labels or background). Without such analysis or a controlled ablation (clean vs. noisy labels), it is unclear whether the observed gains arise from beneficial local alignment or from other factors.

Authors: We agree that a direct quantitative assessment of pseudo-label quality would strengthen the central claim. While the consistent gains on novel categories across COCO and LVIS experiments provide indirect evidence that the pseudo-labels yield a net-positive signal for local alignment, the original manuscript does not report explicit noise rates or bias statistics on novel regions. In the revised manuscript we will add (i) a quantitative breakdown of pseudo-label accuracy on novel objects by comparing detector outputs against ground-truth annotations on the validation sets and (ii) a controlled ablation that trains with clean ground-truth labels versus the noisy pseudo-labels generated by the closed-set detector. revision: yes
Referee: [Experiments section] The experiments section claims consistent improvements and new SOTA on COCO and LVIS for both novel and known categories, but the provided abstract and summary contain no numerical metrics, baseline tables, or ablation results on the decoupled fine-tuning versus weight-interpolation components. This prevents verification that the gains are attributable to DAT rather than the underlying cooperative model or training details.

Authors: The full manuscript (Section 4) already contains the numerical results, baseline comparison tables, and component ablations for both COCO and LVIS, including separate evaluations of the decoupled fine-tuning and weight-interpolation steps. To improve immediate verifiability we will (i) revise the abstract to report the key quantitative gains and (ii) add a concise summary table of main results plus an expanded ablation isolating the two DAT components in the introduction or early experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical adaptation without derivations or self-referential claims

full rationale

The paper describes a plug-and-play empirical method (DAT) that generates pseudo-labels from an external pre-trained closed-set detector and uses them to fine-tune only the VLM visual backbone, followed by weight interpolation. No equations, derivations, fitted parameters presented as predictions, or mathematical claims appear in the abstract or description. The central performance claims rest on experiments on COCO and LVIS rather than reducing to inputs defined inside the work. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. This is a standard empirical contribution whose validity is externally falsifiable via the reported dataset results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a pre-trained closed-set detector can generate usable pseudo-labels for novel objects and that standard VLM pre-training already encodes sufficient global semantics to be preserved by interpolation. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Pre-trained closed-set detectors produce region proposals whose labels, even when incomplete for novel classes, still provide useful training signal for VLM adaptation.
Invoked when constructing the region-aware pseudo-labeled dataset from the closed-set detector output.
domain assumption Weight interpolation after fine-tuning the visual backbone preserves global semantic knowledge without degrading local feature alignment.
Stated as the mechanism that allows decoupled adaptation while keeping prior VLM capabilities.

pith-pipeline@v0.9.0 · 5531 in / 1440 out tokens · 49729 ms · 2026-05-07T17:52:57.542307+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Towards open-setobjectdetectionanddiscovery

Jiyang Zheng, Weihao Li, Jie Hong, Lars Petersson, and Nick Barnes. Towards open-setobjectdetectionanddiscovery. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

2022
[2]

Open-vocabulary object detection with an open corpus

Jiong Wang, Huiming Zhang, Haiwen Hong, Xuan Jin, Yuan He, Hui Xue, and Zhou Zhao. Open-vocabulary object detection with an open corpus. InProc. IEEE Int. Conf. Comput. Vis., 2023

2023
[3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. InProc. Int. Conf. Mach. Learn., 2021

2021
[4]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. IEEE Int. Conf. Comput. Vis., 2023

2023
[5]

Open- vocabulary object detection using captions

Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open- vocabulary object detection using captions. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

2021
[7]

Open- vocabulary object detection upon frozen vision and language models

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. Open- vocabulary object detection upon frozen vision and language models. InProc. Int. Conf. Learn. Represent., 2022

2022
[8]

Unbiased region-language alignment for open-vocabulary dense pre- diction

Yunheng Li, Yuxuan Li, Quan-Sheng Zeng, Wenhai Wang, Qibin Hou, and Ming- Ming Cheng. Unbiased region-language alignment for open-vocabulary dense pre- diction. InProc. IEEE Int. Conf. Comput. Vis., 2025

2025
[9]

RegionCLIP: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. RegionCLIP: Region-based language-image pretraining. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

2022
[10]

Scaling open-vocabulary object detection.Adv

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Adv. Neural Inf. Process. Syst., 36:72983–73007, 2023

2023
[11]

Bridging the gap between object and image- levelrepresentationsforopen-vocabularydetection.Adv

Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image- levelrepresentationsforopen-vocabularydetection.Adv. Neural Inf. Process. Syst., 35:33781–33794, 2022

2022
[12]

FILIP:Fine-grainedinteractivelanguage- image pre-training

LeweiYao,RunhuiHuang,LuHou,etal. FILIP:Fine-grainedinteractivelanguage- image pre-training. InProc. Int. Conf. Learn. Represent., 2022

2022
[13]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-shot detection via vi- sion and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2(3):4, 2021

work page internal anchor Pith review arXiv 2021
[14]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proc. IEEE Int. Conf. Comput. Vis., 2017

2017
[15]

Learning to prompt for open-vocabulary object detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

2022
[16]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft COCO: Common objects in context. InProc. Eur. Conf. Comput. Vis., 2014

2014
[17]

LVIS: A dataset for large vocabu- lary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabu- lary instance segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019

2019
[18]

Open- vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open- vocabulary detr with conditional matching. InProc. Eur. Conf. Comput. Vis., 2022

2022
[19]

End-to-end object de- tection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, et al. End-to-end object de- tection with transformers. InProc. Eur. Conf. Comput. Vis., 2020

2020
[20]

Grounding DINO: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. InProc. Eur. Conf. Comput. Vis., 2024

2024
[21]

DetCLIPv3: Towards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: Towards versatile generative open-vocabulary object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024

2024
[22]

Detecting twenty-thousand classes using image-level supervision

XingyiZhou,RohitGirdhar,ArmandJoulin,PhilippKrähenbühl,andIshanMisra. Detecting twenty-thousand classes using image-level supervision. InProc. Eur. Conf. Comput. Vis., 2022

2022
[23]

Aligning bag of regions for open-vocabulary object detection

Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for open-vocabulary object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023

2023
[24]

CORA: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching

Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. CORA: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023

2023
[25]

Enhancing novel object detection via cooperative foundational models

Rohit Bharadwaj, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. Enhancing novel object detection via cooperative foundational models. InProc. IEEE Winter Conf. Appl. Comput. Vis., 2025

2025
[26]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, et al. Segment anything. InProc. IEEE Int. Conf. Comput. Vis., 2023

2023
[27]

Towards a unified view of parameter-efficient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InProc. Int. Conf. Learn. Represent., 2022

2022
[28]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, et al. Robust fine-tuning of zero-shot models. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

2022
[29]

LoRA:Low-rankadaptationoflargelanguage models.Proc

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,LuWang,WeizhuChen,etal. LoRA:Low-rankadaptationoflargelanguage models.Proc. Int. Conf. Learn. Represent., 1(2):3, 2022

2022
[30]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, et al. Visual prompt tuning. InProc. Eur. Conf. Comput. Vis., 2022

2022
[31]

CLIP-Adapter: Better vision-language models with feature adapters.Int

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.Int. J. Comput. Vis., 132(2):581–595, 2024

2024
[32]

DiffCLIP: Few-shot language-driven multimodal classifier

Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, and Yunsong Li. DiffCLIP: Few-shot language-driven multimodal classifier. InProc. AAAI Conf. Artif. Intell., 2025

2025
[33]

AdaptFormer: Adapting vision transformers for scalable visual recognition.Adv

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. AdaptFormer: Adapting vision transformers for scalable visual recognition.Adv. Neural Inf. Process. Syst., 35:16664–16678, 2022

2022
[34]

Self-supervised visual feature learning with deep neural networks: A survey.IEEE Trans

Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 43(11):4037– 4058, 2020

2020
[35]

VirTex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annotations. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

2021
[36]

Frus- tratingly simple few-shot object detection

Xin Wang, Thomas Huang, Joseph Gonzalez, Trevor Darrell, and Fisher Yu. Frus- tratingly simple few-shot object detection. InProc. Int. Conf. Mach. Learn., 2020

2020
[37]

SQ-LLaVA:Self-questioningforlargevision-languageassistant

Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao. SQ-LLaVA:Self-questioningforlargevision-languageassistant. InProc. Eur. Conf. Comput. Vis., 2024

2024
[38]

Dynamic size counting in the popula- tion protocol model

Dominik Kaaser and Maximilian Lohmann. Dynamic size counting in the popula- tion protocol model. InProc. ACM Symp. Princ. Distrib. Comput., 2024

2024
[39]

Some methods for classification and analysis of multivariate observa- tions

J McQueen. Some methods for classification and analysis of multivariate observa- tions. InProc. 5th Berkeley Symp. Math. Statist. Probab., 1967

1967
[40]

Open-world semi-supervised learning

Kaidi Cao, Maria Brbic, and Jure Leskovec. Open-world semi-supervised learning. InProc. Int. Conf. Learn. Represent., 2022

2022
[41]

A unified objective for novel class discovery

Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun Zhong, Moin Nabi, and Elisa Ricci. A unified objective for novel class discovery. InProc. IEEE Int. Conf. Comput. Vis., 2021

2021
[42]

Learning to discover and detect objects.Adv

Vladimir Fomenko, Ismail Elezi, Deva Ramanan, Laura Leal-Taixé, and Aljosa Osep. Learning to discover and detect objects.Adv. Neural Inf. Process. Syst., 35:8746–8759, 2022

2022