Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Pith reviewed 2026-05-17 13:26 UTC · model grok-4.3
The pith
A vision-language distillation method trains object detectors to recognize arbitrary text-described objects, including categories never seen in training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViLD distills knowledge from a pretrained open-vocabulary image classification model into a two-stage detector by encoding category texts and image regions with the teacher, then training the student so that its region embeddings align with those teacher embeddings; on LVIS this yields 16.1 mask AP_r with a ResNet-50 backbone (3.8 above the supervised baseline) and 26.3 AP_r with a stronger ALIGN teacher, while enabling direct transfer that reaches 72.2 AP50 on PASCAL VOC and outperforms prior work by 4.8 novel AP on COCO.
What carries the argument
The ViLD alignment loss that matches student region embeddings to both text and image embeddings produced by the teacher on the same object proposals.
If this is right
- Detectors gain the ability to detect novel classes described only by text without additional labeled training data.
- Performance on rare categories can exceed that of models trained with full supervision on those categories.
- A single model transfers directly to other detection datasets such as COCO and PASCAL VOC without finetuning.
- Stronger vision-language teachers produce correspondingly stronger open-vocabulary detectors.
Where Pith is reading between the lines
- The same distillation pattern could be applied to instance segmentation or other region-based tasks where text supervision is scarce.
- Large-scale vision-language models may serve as reusable teachers that reduce the annotation burden for many specialized vision problems.
- If the alignment holds across domains, detectors could be adapted to new visual styles or vocabularies by updating only the teacher rather than recollecting detection data.
Load-bearing premise
Embeddings produced by the teacher on image regions and category texts remain sufficiently aligned with the student's region proposals even for categories never seen during detector training.
What would settle it
Training ViLD on LVIS while holding out rare categories and measuring whether mask AP_r on those categories falls below the supervised baseline would falsify the claim that the distillation enables effective open-vocabulary detection.
read the original abstract
We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViLD, a knowledge distillation approach for open-vocabulary object detection. A pretrained vision-language teacher model (e.g., CLIP or ALIGN) encodes both category texts and image regions from object proposals; these embeddings supervise a two-stage student detector whose region features are aligned to the teacher's text and image embeddings via distillation losses. On LVIS, rare categories are held out as novel classes unseen during training. With a ResNet-50 backbone the method reports 16.1 mask AP_r on rare categories (3.8 above the supervised baseline); with an ALIGN teacher it reaches 26.3 AP_r. Zero-shot transfer yields 72.2 AP50 on PASCAL VOC, 36.6 AP on COCO, and 11.8 AP on Objects365, outperforming prior SOTA on COCO novel AP by 4.8.
Significance. If the reported gains are reproducible, the work would be significant for demonstrating that large-scale vision-language pretraining can be transferred to detection without exhaustive per-category labels. The open-sourcing of code and demo is a clear strength that supports reproducibility and follow-up research. The empirical results on held-out rare classes and cross-dataset transfer provide concrete evidence that the distillation route can exceed fully supervised baselines on novel categories.
major comments (3)
- [§4 and Table 1] §4 (LVIS experiments) and Table 1: the central claim that ViLD generalizes to novel categories rests on the unverified assumption that student region embeddings (produced by an RPN and box head trained exclusively on base categories) remain aligned with the teacher's embeddings for rare classes. No direct measurement of proposal recall, embedding cosine similarity, or ranking quality on held-out rare categories is reported; without such diagnostics the 16.1 AP_r and 3.8 gain over the supervised baseline cannot be confidently attributed to open-vocabulary generalization rather than base-class proposal statistics.
- [§3.2] §3.2 (distillation losses): the alignment between student region features and teacher text/image embeddings is controlled by loss weights and temperature; these hyperparameters are listed among the free parameters yet no ablation shows their effect on novel-category AP_r versus base-category performance, leaving open whether the reported gains are robust or sensitive to choices tuned on base data only.
- [§4] Experimental details: the manuscript provides concrete AP numbers but omits error bars, number of runs, and full hyperparameter tables. Given that the 3.8 AP_r margin is modest, statistical significance and variance estimates are needed to establish that the improvement is reliable rather than within run-to-run fluctuation.
minor comments (2)
- [Abstract] Abstract: the term 'mask AP_r' should be defined on first use (rare-category mask average precision) to aid readers unfamiliar with LVIS protocol.
- [§4.3] The transfer experiments on PASCAL VOC, COCO, and Objects365 are presented without finetuning; a brief statement clarifying how category names are mapped to the target datasets would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and outline revisions to improve the manuscript where feasible.
read point-by-point responses
-
Referee: [§4 and Table 1] §4 (LVIS experiments) and Table 1: the central claim that ViLD generalizes to novel categories rests on the unverified assumption that student region embeddings (produced by an RPN and box head trained exclusively on base categories) remain aligned with the teacher's embeddings for rare classes. No direct measurement of proposal recall, embedding cosine similarity, or ranking quality on held-out rare categories is reported; without such diagnostics the 16.1 AP_r and 3.8 gain over the supervised baseline cannot be confidently attributed to open-vocabulary generalization rather than base-class proposal statistics.
Authors: The supervised baseline uses identical RPN and box-head training on base categories only, so proposal statistics are matched in the comparison. The observed 3.8 AP_r gain on rare classes therefore arises from the student's ability to leverage the teacher's open-vocabulary text embeddings for classification, which the supervised model lacks. To strengthen the evidence, the revised manuscript will add proposal-recall statistics for rare ground-truth boxes and average cosine similarity between student region embeddings and the corresponding teacher embeddings on rare-class proposals. revision: yes
-
Referee: [§3.2] §3.2 (distillation losses): the alignment between student region features and teacher text/image embeddings is controlled by loss weights and temperature; these hyperparameters are listed among the free parameters yet no ablation shows their effect on novel-category AP_r versus base-category performance, leaving open whether the reported gains are robust or sensitive to choices tuned on base data only.
Authors: We agree that an explicit ablation on loss weights and temperature is valuable for demonstrating robustness. The revised version will include a new table that varies these hyperparameters and reports the resulting base and novel AP_r, confirming that the reported gains remain stable across reasonable settings. revision: yes
-
Referee: [§4] Experimental details: the manuscript provides concrete AP numbers but omits error bars, number of runs, and full hyperparameter tables. Given that the 3.8 AP_r margin is modest, statistical significance and variance estimates are needed to establish that the improvement is reliable rather than within run-to-run fluctuation.
Authors: We will add a complete hyperparameter table to the appendix. Because of the substantial compute required to train on LVIS, all reported numbers come from single runs; we will explicitly note this limitation and the consistent improvement observed across both ResNet-50 and ALIGN teachers. revision: partial
- Error bars and statistical significance from multiple independent runs, as experiments were performed only once due to computational cost.
Circularity Check
No significant circularity; empirical results rest on external teacher models and held-out evaluation
full rationale
The paper describes an empirical knowledge-distillation procedure that trains a student detector on base LVIS categories while aligning its region embeddings to embeddings produced by an external pretrained teacher (CLIP or ALIGN). Reported numbers such as 16.1 mask AP_r and the 3.8 gain over the supervised baseline are measured on held-out rare categories and on transfer datasets (PASCAL VOC, COCO, Objects365). No equations, fitted parameters, or self-citations are shown to reduce these performance figures to the training inputs by construction. The method relies on independently pretrained teacher models and standard detection benchmarks, rendering the central claims self-contained against external references.
Axiom & Free-Parameter Ledger
free parameters (1)
- distillation loss weights and temperature
axioms (1)
- domain assumption Pretrained vision-language models produce embeddings that generalize to object regions in detection images for categories outside their original training distribution.
Forward citations
Cited by 19 Pith papers
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
Revisiting Shadow Detection from a Vision-Language Perspective
SVL uses language embeddings aligned with global image representations via shadow ratio regression and global-to-local coupling to improve shadow detection robustness in ambiguous cases.
-
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
-
The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection
DAT adapts vision-language models for open-vocabulary detection via self-supervised pseudo-labeling from closed-set detectors, improving performance on novel and known classes while fine-tuning under 0.8M parameters.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests
Granularity-aware distillation improves tree instance segmentation accuracy on real forest images by merging logits and unifying masks from fine-grained synthetic teachers despite coarse real labels.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
-
Language-driven Semantic Segmentation
LSeg achieves competitive zero-shot semantic segmentation by contrastively aligning dense pixel embeddings from a transformer with text embeddings of class labels.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...
-
OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.
-
Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning
SQI uses axiomatic constraints, hierarchical decomposition, and counterfactual verification to align linguistic reasoning with visual perception in frozen VLMs, achieving second place on the DataCV 2026 illusion challenge.
-
A Multimodal Depth-Aware Method For Embodied Reference Understanding
A depth-aware multimodal ERU framework with LLM data augmentation and a depth-aware decision module outperforms baselines for referent detection on two datasets.
Reference graph
Works this paper leans on
-
[1]
Lvis: A dataset for large vocabulary instance seg- mentation
12 Published as a conference paper at ICLR 2022 Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance seg- mentation. In CVPR,
work page 2022
-
[2]
Zero shot recognition with unreliable attributes
Dinesh Jayaraman and Kristen Grauman. Zero shot recognition with unreliable attributes. NeurIPS 2014,
work page 2014
-
[3]
Faster r-cnn: Towards real-time object detection with region proposal networks
13 Published as a conference paper at ICLR 2022 Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS,
work page 2022
-
[4]
Jingru Tan, Gang Zhang, Hanming Deng, Changbao Wang, Lewei Lu, quanquan Li, and Jifeng Dai. Technical report: A good box is not a guarantee of a good mask.Joint COCO and LVIS workshop at ECCV 2020: LVIS Challenge Track,
work page 2020
-
[5]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset
work page 2011
-
[6]
Simple multi-dataset detection
Xingyi Zhou, Vladlen Koltun, and Philipp Kr ¨ahenb¨uhl. Simple multi-dataset detection. arXiv preprint arXiv:2102.13086,
-
[7]
14 Published as a conference paper at ICLR 2022 APPENDIX A A DDITIONAL QUALITATIVE RESULTS Transfer to PASCAL VOC: In Fig. 8, we show qualitative results of transferring an open- vocabulary detector trained on LVIS (Gupta et al.,
work page 2022
-
[8]
to PASCAL VOC Detection (2007 test set) (Everingham et al., 2010), without finetuning (Sec. 4.6 in the main paper). Results demonstrate that the transferring works well. Figure 8: Transfer to PASCAL VOC. ViLD correctly detects objects when transferred to PASCAL VOC, where images usually have lower resolution than LVIS (our training set). In the third pictu...
work page 2007
-
[9]
It seems that the mask prediction is sometimes based on low-level appearance rather than semantics. B A NALYSIS OF CLIP ON CROPPED REGIONS In this section, we analyze some common failure cases of CLIP on cropped regions and discuss possible ways to mitigate these problems. Visual similarity: This confusion is common for any classifiers and detectors, espec...
work page 2022
-
[10]
a photo of{article} {category}
It resizes the shorter edge of the image to 224, and then uses a center crop. However, since region proposals can have more extreme aspect ratios than the training images for CLIP, and some proposals are tiny, we directly resize the proposals to that resolution, which might cause some issues. For example, the thin structure in Fig. 11(b) right will be hig...
work page 2022
-
[11]
Open-vocabulary classification on cropped regions: After obtaining CLIP confidence scores for the 1000 proposals, we apply a class-specific NMS with a threshold of 0.6, and output the top 300 detections as the final results. Additional details for ViLD variants: Different from the R-CNN style experiments, for all ViLD variants (Sec. 3.3, Sec. 3.4), we use the...
work page 2015
-
[12]
perform contrastive pre-training on a large number of image-text pairs. Given a batch of paired images and texts, the model learns to maximize the cosine similarity between the embeddings of the corresponding image and text pairs, while minimizing the cosine similarity between other pairs. Specifically, for CLIP (Radford et al., 2021), we use the version w...
work page 2021
-
[13]
For ALIGN (Jia et al., 2021), its image encoder is an EfficientNet (Tan & Le,
architecture and the text encoder is a Transformer (Vaswani et al., 2017). For ALIGN (Jia et al., 2021), its image encoder is an EfficientNet (Tan & Le,
work page 2017
-
[14]
and its text encoder is a BERT (Devlin et al., 2019). Details for ViLD with stronger teacher models: In both experiments with CLIP ViT-L/14 and ALIGN, we use EfficientNet-b7 as the backbone and ViLD-ensemble for better performance. We 19 Published as a conference paper at ICLR 2022 also crop the RoI features from only FPN level P3 in the feature pyramid. T...
work page 2019
-
[15]
For ViLD w/ ALIGN, we use the ALIGN model with an EfficientNet-l2 image encoder and a BERT-large text encoder as the teacher model. We modify several places in the Mask R- CNN architecture to better distill the knowledge from the teacher. We equip the ViLD-image head in ViLD-ensemble with the MBConvBlocks in EfficientNet. Since the MBConvBlocks are fully- c...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.