ViLD distills region and text embeddings from a teacher vision-language model into a student detector, enabling open-vocabulary detection that outperforms supervised baselines on held-out rare classes in LVIS and transfers to COCO, VOC, and Objects365.
B A NALYSIS OF CLIP ON CROPPED REGIONS In this section, we analyze some common failure cases of CLIP on cropped regions and discuss possible ways to mitigate these problems
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2021 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
ViLD distills region and text embeddings from a teacher vision-language model into a student detector, enabling open-vocabulary detection that outperforms supervised baselines on held-out rare classes in LVIS and transfers to COCO, VOC, and Objects365.