pith. machine review for the scientific record. sign in

arxiv: 2605.03642 · v1 · submitted 2026-05-05 · 💻 cs.CV

The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

Pith reviewed 2026-05-07 17:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary object detectionself-supervised learningvision-language modelspseudo-labelingmodel adaptationcooperative detectionfine-tuning
0
0 comments X

The pith

Self-supervised fine-tuning with detector pseudo-labels improves vision-language models for open-vocabulary object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Decoupled Adaptivity Training (DAT) to adapt vision-language models for better region-level object detection in an open-vocabulary setting. It uses a closed-set detector to generate pseudo-labels on regions, including novel objects, then fine-tunes only the VLM's visual backbone in a decoupled way that preserves global semantics through weight interpolation. This lightweight approach, tuning under 0.8 million parameters, requires no extra labels or inference cost. Tests on COCO and LVIS datasets demonstrate gains for both new and familiar categories, setting a new standard for cooperative open-vocabulary detection methods.

Core claim

Decoupled Adaptivity Training constructs a region-aware pseudo-labeled dataset from a pre-trained closed-set detector and fine-tunes the VLM visual backbone decoupledly to enhance local feature alignment while preserving global knowledge via weight interpolation, leading to improved performance on novel and known object categories in cooperative open-vocabulary detection.

What carries the argument

Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning method that uses pseudo-labels from a closed-set detector to adapt the VLM's visual backbone with weight interpolation.

If this is right

  • Consistent performance gains on both novel and known categories in open-vocabulary detection.
  • No additional inference overhead or labeled data required.
  • Applicable as a plug-and-play module to existing cooperative models.
  • Fine-tunes fewer than 0.8M parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may enable iterative self-improvement by using the adapted VLM to refine future pseudo-labels.
  • It highlights a general strategy for bridging image-level pretraining of VLMs with region-level tasks using auxiliary detectors.
  • Similar adaptation could extend to other dense prediction tasks like segmentation.

Load-bearing premise

Pseudo-labels produced by the closed-set detector for novel object regions are accurate and useful enough to guide beneficial adaptation without adding harmful noise or bias.

What would settle it

A controlled test where the closed-set detector's pseudo-labels are deliberately made inaccurate on novel objects, followed by observing whether DAT still improves or instead degrades VLM performance.

Figures

Figures reproduced from arXiv: 2605.03642 by Changjae Oh (Queen Mary University of London), Yazhe Wan.

Figure 1
Figure 1. Figure 1: Global-local feature imbalance in VLMs under cooperative closed-set and view at source ↗
Figure 2
Figure 2. Figure 2: The Cooperative Foundational Models (CFM) pipeline with the VLM view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Decoupled Adaptivity Training (DAT). Using view at source ↗
Figure 4
Figure 4. Figure 4: Sample results of Cooperative Foundational Models (CFM) and ours. view at source ↗
read the original abstract

Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Decoupled Adaptivity Training (DAT), a lightweight self-supervised adaptation method for vision-language models (VLMs) used in cooperative open-vocabulary object detection. Given a closed-set detector paired with a VLM, DAT generates region-aware pseudo-labels by running the detector over images (where novel-object regions may be unlabeled or mislabeled), then fine-tunes only the VLM visual backbone in a decoupled fashion while using weight interpolation to retain global semantics. The method is presented as plug-and-play with <0.8M tunable parameters and no inference cost. Experiments on COCO and LVIS are claimed to show consistent gains on both novel and base categories, establishing a new state of the art.

Significance. If the reported gains are robust, DAT offers an efficient way to improve local feature alignment in image-level pre-trained VLMs for region-level detection tasks without retraining the full model or adding runtime overhead. This could strengthen cooperative detector+VLM pipelines for open-vocabulary settings by providing a simple adaptation step that leverages existing closed-set detectors.

major comments (2)
  1. [Method section (pseudo-label dataset construction)] The central claim that pseudo-labels from the closed-set detector produce a net-positive adaptation signal for novel-object regions is load-bearing, yet the manuscript provides no quantitative assessment of pseudo-label accuracy, noise rate, or bias specifically on novel categories (e.g., fraction of novel regions assigned wrong base-class labels or background). Without such analysis or a controlled ablation (clean vs. noisy labels), it is unclear whether the observed gains arise from beneficial local alignment or from other factors.
  2. [Experiments section] The experiments section claims consistent improvements and new SOTA on COCO and LVIS for both novel and known categories, but the provided abstract and summary contain no numerical metrics, baseline tables, or ablation results on the decoupled fine-tuning versus weight-interpolation components. This prevents verification that the gains are attributable to DAT rather than the underlying cooperative model or training details.
minor comments (1)
  1. [Abstract] The abstract would benefit from at least one or two key quantitative results (e.g., mAP gains on novel classes) to substantiate the 'consistent improvements' and 'new SOTA' statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Method section (pseudo-label dataset construction)] The central claim that pseudo-labels from the closed-set detector produce a net-positive adaptation signal for novel-object regions is load-bearing, yet the manuscript provides no quantitative assessment of pseudo-label accuracy, noise rate, or bias specifically on novel categories (e.g., fraction of novel regions assigned wrong base-class labels or background). Without such analysis or a controlled ablation (clean vs. noisy labels), it is unclear whether the observed gains arise from beneficial local alignment or from other factors.

    Authors: We agree that a direct quantitative assessment of pseudo-label quality would strengthen the central claim. While the consistent gains on novel categories across COCO and LVIS experiments provide indirect evidence that the pseudo-labels yield a net-positive signal for local alignment, the original manuscript does not report explicit noise rates or bias statistics on novel regions. In the revised manuscript we will add (i) a quantitative breakdown of pseudo-label accuracy on novel objects by comparing detector outputs against ground-truth annotations on the validation sets and (ii) a controlled ablation that trains with clean ground-truth labels versus the noisy pseudo-labels generated by the closed-set detector. revision: yes

  2. Referee: [Experiments section] The experiments section claims consistent improvements and new SOTA on COCO and LVIS for both novel and known categories, but the provided abstract and summary contain no numerical metrics, baseline tables, or ablation results on the decoupled fine-tuning versus weight-interpolation components. This prevents verification that the gains are attributable to DAT rather than the underlying cooperative model or training details.

    Authors: The full manuscript (Section 4) already contains the numerical results, baseline comparison tables, and component ablations for both COCO and LVIS, including separate evaluations of the decoupled fine-tuning and weight-interpolation steps. To improve immediate verifiability we will (i) revise the abstract to report the key quantitative gains and (ii) add a concise summary table of main results plus an expanded ablation isolating the two DAT components in the introduction or early experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical adaptation without derivations or self-referential claims

full rationale

The paper describes a plug-and-play empirical method (DAT) that generates pseudo-labels from an external pre-trained closed-set detector and uses them to fine-tune only the VLM visual backbone, followed by weight interpolation. No equations, derivations, fitted parameters presented as predictions, or mathematical claims appear in the abstract or description. The central performance claims rest on experiments on COCO and LVIS rather than reducing to inputs defined inside the work. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. This is a standard empirical contribution whose validity is externally falsifiable via the reported dataset results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a pre-trained closed-set detector can generate usable pseudo-labels for novel objects and that standard VLM pre-training already encodes sufficient global semantics to be preserved by interpolation. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Pre-trained closed-set detectors produce region proposals whose labels, even when incomplete for novel classes, still provide useful training signal for VLM adaptation.
    Invoked when constructing the region-aware pseudo-labeled dataset from the closed-set detector output.
  • domain assumption Weight interpolation after fine-tuning the visual backbone preserves global semantic knowledge without degrading local feature alignment.
    Stated as the mechanism that allows decoupled adaptation while keeping prior VLM capabilities.

pith-pipeline@v0.9.0 · 5531 in / 1440 out tokens · 49729 ms · 2026-05-07T17:52:57.542307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Towards open-setobjectdetectionanddiscovery

    Jiyang Zheng, Weihao Li, Jie Hong, Lars Petersson, and Nick Barnes. Towards open-setobjectdetectionanddiscovery. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

  2. [2]

    Open-vocabulary object detection with an open corpus

    Jiong Wang, Huiming Zhang, Haiwen Hong, Xuan Jin, Yuan He, Hui Xue, and Zhou Zhao. Open-vocabulary object detection with an open corpus. InProc. IEEE Int. Conf. Comput. Vis., 2023

  3. [3]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. InProc. Int. Conf. Mach. Learn., 2021

  4. [4]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. IEEE Int. Conf. Comput. Vis., 2023

  5. [5]

    Open- vocabulary object detection using captions

    Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open- vocabulary object detection using captions. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

  6. [7]

    Open- vocabulary object detection upon frozen vision and language models

    Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. Open- vocabulary object detection upon frozen vision and language models. InProc. Int. Conf. Learn. Represent., 2022

  7. [8]

    Unbiased region-language alignment for open-vocabulary dense pre- diction

    Yunheng Li, Yuxuan Li, Quan-Sheng Zeng, Wenhai Wang, Qibin Hou, and Ming- Ming Cheng. Unbiased region-language alignment for open-vocabulary dense pre- diction. InProc. IEEE Int. Conf. Comput. Vis., 2025

  8. [9]

    RegionCLIP: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. RegionCLIP: Region-based language-image pretraining. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

  9. [10]

    Scaling open-vocabulary object detection.Adv

    Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Adv. Neural Inf. Process. Syst., 36:72983–73007, 2023

  10. [11]

    Bridging the gap between object and image- levelrepresentationsforopen-vocabularydetection.Adv

    Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image- levelrepresentationsforopen-vocabularydetection.Adv. Neural Inf. Process. Syst., 35:33781–33794, 2022

  11. [12]

    FILIP:Fine-grainedinteractivelanguage- image pre-training

    LeweiYao,RunhuiHuang,LuHou,etal. FILIP:Fine-grainedinteractivelanguage- image pre-training. InProc. Int. Conf. Learn. Represent., 2022

  12. [13]

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-shot detection via vi- sion and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2(3):4, 2021

  13. [14]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proc. IEEE Int. Conf. Comput. Vis., 2017

  14. [15]

    Learning to prompt for open-vocabulary object detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

  15. [16]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft COCO: Common objects in context. InProc. Eur. Conf. Comput. Vis., 2014

  16. [17]

    LVIS: A dataset for large vocabu- lary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabu- lary instance segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019

  17. [18]

    Open- vocabulary detr with conditional matching

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open- vocabulary detr with conditional matching. InProc. Eur. Conf. Comput. Vis., 2022

  18. [19]

    End-to-end object de- tection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, et al. End-to-end object de- tection with transformers. InProc. Eur. Conf. Comput. Vis., 2020

  19. [20]

    Grounding DINO: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. InProc. Eur. Conf. Comput. Vis., 2024

  20. [21]

    DetCLIPv3: Towards versatile generative open-vocabulary object detection

    Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: Towards versatile generative open-vocabulary object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024

  21. [22]

    Detecting twenty-thousand classes using image-level supervision

    XingyiZhou,RohitGirdhar,ArmandJoulin,PhilippKrähenbühl,andIshanMisra. Detecting twenty-thousand classes using image-level supervision. InProc. Eur. Conf. Comput. Vis., 2022

  22. [23]

    Aligning bag of regions for open-vocabulary object detection

    Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for open-vocabulary object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023

  23. [24]

    CORA: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching

    Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. CORA: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023

  24. [25]

    Enhancing novel object detection via cooperative foundational models

    Rohit Bharadwaj, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. Enhancing novel object detection via cooperative foundational models. InProc. IEEE Winter Conf. Appl. Comput. Vis., 2025

  25. [26]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, et al. Segment anything. InProc. IEEE Int. Conf. Comput. Vis., 2023

  26. [27]

    Towards a unified view of parameter-efficient transfer learning

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InProc. Int. Conf. Learn. Represent., 2022

  27. [28]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, et al. Robust fine-tuning of zero-shot models. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

  28. [29]

    LoRA:Low-rankadaptationoflargelanguage models.Proc

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,LuWang,WeizhuChen,etal. LoRA:Low-rankadaptationoflargelanguage models.Proc. Int. Conf. Learn. Represent., 1(2):3, 2022

  29. [30]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, et al. Visual prompt tuning. InProc. Eur. Conf. Comput. Vis., 2022

  30. [31]

    CLIP-Adapter: Better vision-language models with feature adapters.Int

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.Int. J. Comput. Vis., 132(2):581–595, 2024

  31. [32]

    DiffCLIP: Few-shot language-driven multimodal classifier

    Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, and Yunsong Li. DiffCLIP: Few-shot language-driven multimodal classifier. InProc. AAAI Conf. Artif. Intell., 2025

  32. [33]

    AdaptFormer: Adapting vision transformers for scalable visual recognition.Adv

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. AdaptFormer: Adapting vision transformers for scalable visual recognition.Adv. Neural Inf. Process. Syst., 35:16664–16678, 2022

  33. [34]

    Self-supervised visual feature learning with deep neural networks: A survey.IEEE Trans

    Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 43(11):4037– 4058, 2020

  34. [35]

    VirTex: Learning visual representations from textual annotations

    Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annotations. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

  35. [36]

    Frus- tratingly simple few-shot object detection

    Xin Wang, Thomas Huang, Joseph Gonzalez, Trevor Darrell, and Fisher Yu. Frus- tratingly simple few-shot object detection. InProc. Int. Conf. Mach. Learn., 2020

  36. [37]

    SQ-LLaVA:Self-questioningforlargevision-languageassistant

    Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao. SQ-LLaVA:Self-questioningforlargevision-languageassistant. InProc. Eur. Conf. Comput. Vis., 2024

  37. [38]

    Dynamic size counting in the popula- tion protocol model

    Dominik Kaaser and Maximilian Lohmann. Dynamic size counting in the popula- tion protocol model. InProc. ACM Symp. Princ. Distrib. Comput., 2024

  38. [39]

    Some methods for classification and analysis of multivariate observa- tions

    J McQueen. Some methods for classification and analysis of multivariate observa- tions. InProc. 5th Berkeley Symp. Math. Statist. Probab., 1967

  39. [40]

    Open-world semi-supervised learning

    Kaidi Cao, Maria Brbic, and Jure Leskovec. Open-world semi-supervised learning. InProc. Int. Conf. Learn. Represent., 2022

  40. [41]

    A unified objective for novel class discovery

    Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun Zhong, Moin Nabi, and Elisa Ricci. A unified objective for novel class discovery. InProc. IEEE Int. Conf. Comput. Vis., 2021

  41. [42]

    Learning to discover and detect objects.Adv

    Vladimir Fomenko, Ismail Elezi, Deva Ramanan, Laura Leal-Taixé, and Aljosa Osep. Learning to discover and detect objects.Adv. Neural Inf. Process. Syst., 35:8746–8759, 2022