DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

Chuang Zhu; Donghong Jiang; Endian Lin; Hanqing Liu; Luoping Cui; Mingjie Liu; Zhao Yang

arxiv: 2605.18023 · v1 · pith:MOH3FBK2new · submitted 2026-05-18 · 💻 cs.CV

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

Donghong Jiang , Endian Lin , Hanqing Liu , Mingjie Liu , Luoping Cui , Zhao Yang , Chuang Zhu This is my paper

Pith reviewed 2026-05-20 12:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary object detectionfine-grained detectionattribute activationattribute prefix adapterkey value modulatorcontrastive lossattribute bindingdual-stage framework

0 comments

The pith

Open-vocabulary detection models bind attributes to the wrong objects when category signals dominate inference, and the DSAA framework corrects this by activating attribute information at two stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that open-vocabulary object detection fails at fine-grained tasks with attributes such as color, material, and texture because category signals overpower and sideline attribute details during inference. This produces mismatched bindings between attributes and objects. The authors introduce the Dual-Stage Attribute Activation framework to strengthen attribute semantics first in the text embedding stage and again during BERT encoding. They add an Attribute Prefix Adapter to inject explicit attribute priors, use a Key/Value Modulator to boost the relevant token vectors, and apply an attribute-aware contrastive loss to sharpen distinctions between same-category items that differ only in attributes. If correct, the approach would raise accuracy on detailed attribute queries for both seen and unseen categories while preserving overall detection performance.

Core claim

We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during

What carries the argument

The Dual-Stage Attribute Activation (DSAA) framework, which strengthens attribute semantics by generating explicit attribute prefixes with the Attribute Prefix Adapter in the text embedding stage and selectively enhancing the Key and Value vectors of attribute tokens with the K/V Modulator during BERT encoding, plus an attribute-aware contrastive loss for training discrimination.

If this is right

Strengthens attribute semantics at the text embedding and BERT encoding stages to improve fine-grained detection.
Enables better discrimination among same-category instances that differ only in attributes.
Applies to various mainstream open-vocabulary detection models without changing their core category detection.
Raises performance on tasks that require identifying specific colors, materials, and textures in unseen categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage activation pattern may help other vision-language models that suffer from weak attribute grounding in captioning or visual question answering.
Extending the modulator to additional token types could address fine-grained signals beyond attributes, such as spatial relations or actions.
Testing the framework on datasets with rarer attribute combinations would reveal whether the binding correction scales to long-tail cases.

Load-bearing premise

The performance bottleneck arises specifically because category signals dominate and marginalize attribute information during inference, producing incorrect attribute-object bindings that the APA module, K/V Modulator, and contrastive loss can fix without offsetting errors.

What would settle it

Running the DSAA modules on the FG-OVD benchmark and finding no gain in fine-grained attribute detection accuracy or a drop in standard category-level performance would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.18023 by Chuang Zhu, Donghong Jiang, Endian Lin, Hanqing Liu, Luoping Cui, Mingjie Liu, Zhao Yang.

**Figure 1.** Figure 1: Motivating example comparing Grounding DINO and DSAA on attribute-sensitive prompts. The baseline model (left) confuses attribute semantics, assigning high confidence to invalid compositions such as “a green dog”, while DSAA (right) correctly rejects mismatched queries and preserves consistent attribute–object binding. abling models to recognize arbitrary categories through natural language prompts. This… view at source ↗

**Figure 2.** Figure 2: Inference pipeline with the proposed Dual-Stage Attribute Activation (DSAA). DSAA activates fine-grained attribute semantics in two stages: (1) An Attribute Prefix Adapter injects explicit attribute priors into the text embedding space, and (2) a K/V Modulator selectively amplifies attribute tokens within early text encoder layers. related work to ours. It assumes that pretrained OVD models already contai… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Dual-Stage Attribute Activation (DSAA) framework. The overall workflow consists of three main components: (1) Attribute Words Extraction: an LLM identifies attribute words and their token positions from input text; (2) Attribute Prefix Insertion: extracted attributes are converted by Attribute Prefix Adapter into prefix tokens and inserted into text embeddings as attribute semantic… view at source ↗

**Figure 4.** Figure 4: Qualitative results of Grounding DINO with and without DSAA on the FG-OVD benchmark. Each row presents detection results under attribute-rich text queries. The top row shows predictions from the baseline, and the bottom row shows those from DSAA. Green/blue boxes denote positive captions (correct attribute–object matches), while red/orange boxes denote negative captions (incorrect or mismatched composition… view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of attribute embeddings on Grounding DINO. DSAA forms more compact and semantically consistent clusters across attributes, demonstrating its effectiveness in improving the separation and coherence of attribute representations compared to the baseline [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Feature distance distribution. DSAA increases the separation between positive and negative samples by 30.2%, demonstrating enhanced feature discriminability. criminability and fine-grained differentiation capability. 5. Conclusion In this work, we identified a core limitation of current openvocabulary detectors: attribute semantics are marginalized under strong category priors, hindering fine-grained reco… view at source ↗

read the original abstract

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSAA gives a practical way to boost attributes in OVD through prefix adapters and modulators, but the evidence tying gains to the specific category-dominance mechanism is still thin.

read the letter

The paper introduces DSAA, a framework that activates attributes at two stages in open-vocabulary detection: first with an Attribute Prefix Adapter that adds explicit attribute information to the text embeddings, and second with a Key/Value Modulator that enhances the relevant tokens in the BERT encoder. They pair this with an attribute-aware contrastive loss. This is a legitimate new technique for handling attributes inside transformer-based OVD systems. It extends prior work on text encoders by making attribute semantics stronger at specific points in the pipeline. The focus on fine-grained attributes like color, material, and texture fills a recognized gap, and integrating the modules without major changes to the base model is a practical strength. The experiments are said to show gains across mainstream models on the FG-OVD benchmark, which is a good sign that the method has broad applicability. The main concern is that the causal story is not fully backed up. The authors attribute the bottleneck to category signals dominating and marginalizing attribute information, but there is no mention of diagnostics such as attention weight comparisons between attribute and category tokens or visualizations of binding failures. If the performance boost comes from generic additions rather than correcting that specific marginalization, then the claim about avoiding offsetting errors on category performance needs more checking. The abstract is high-level on the results, so full numbers and ablations would help clarify this. This paper is for computer vision researchers interested in open-vocabulary and attribute-aware detection. Someone looking for ways to improve detailed object recognition in VL models could find the adapter and modulator designs worth trying. It shows clear thinking on the problem and proposes a focused solution. The work deserves a serious referee to evaluate the empirical support and any potential limitations in the setup.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Dual-Stage Attribute Activation (DSAA) framework to improve fine-grained open-vocabulary object detection (OVD). It attributes the performance bottleneck to category signals dominating and marginalizing attribute information during inference, which causes incorrect attribute-object binding. DSAA addresses this via an Attribute Prefix Adapter (APA) module that injects explicit attribute priors in the text embedding stage, a Key/Value (K/V) Modulator that selectively enhances attribute token vectors during BERT encoding, and an attribute-aware contrastive loss to improve discrimination among same-category instances with differing attributes. Experiments on the FG-OVD benchmark are reported to demonstrate effectiveness across mainstream OVD models.

Significance. If the empirical gains on FG-OVD hold and arise specifically from improved attribute binding rather than generic capacity or supervision effects, the work could offer a practical, targeted enhancement for fine-grained OVD. The dual-stage design and attribute-aware loss provide concrete architectural interventions, and the explicit introduction of APA and K/V Modulator modules supplies a reproducible recipe for strengthening attribute semantics.

major comments (2)

[Introduction] Introduction (core premise paragraph): The claim that category signals dominate and marginalize attribute information (leading to incorrect binding) is presented as the central bottleneck but lacks direct supporting diagnostics such as attention weight comparisons between attribute and category tokens, embedding cosine distances, or failure-case visualizations conditioned on the presence of both signals. Without this evidence, it remains unclear whether the proposed APA and K/V Modulator specifically correct the hypothesized mechanism or whether gains arise from other factors.
[Method] Method section (DSAA framework and K/V Modulator description): No ablation or analysis is provided showing that the modulation of Key/Value vectors for attribute tokens restores correct binding without offsetting degradation on category-level detection. If the modules or contrastive loss term affect overall performance, the claim that DSAA strengthens attribute semantics at the two critical stages without trade-offs requires targeted validation (e.g., separate category-only and attribute-only metrics).

minor comments (2)

[Abstract] Abstract: The statement that results 'demonstrate the effectiveness of our method across various mainstream open-vocabulary models' would be strengthened by naming the specific models and reporting at least one key quantitative delta (e.g., mAP improvement on FG-OVD).
[Method] Notation and equations: The definition of the modulation scale and how it is applied to the Key and Value vectors in the K/V Modulator should be given explicitly (e.g., as an equation) to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying our approach and committing to targeted revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Introduction] Introduction (core premise paragraph): The claim that category signals dominate and marginalize attribute information (leading to incorrect binding) is presented as the central bottleneck but lacks direct supporting diagnostics such as attention weight comparisons between attribute and category tokens, embedding cosine distances, or failure-case visualizations conditioned on the presence of both signals. Without this evidence, it remains unclear whether the proposed APA and K/V Modulator specifically correct the hypothesized mechanism or whether gains arise from other factors.

Authors: We acknowledge that the introduction relies on the observed performance gains on FG-OVD to support the category-dominance hypothesis rather than providing explicit diagnostics such as attention maps or cosine similarities. While these gains across multiple OVD backbones are consistent with improved attribute binding, we agree that direct evidence would more rigorously isolate the mechanism. In the revised manuscript we will add attention weight comparisons between attribute and category tokens along with conditioned failure-case visualizations to demonstrate the binding issue in baselines and its alleviation by DSAA. revision: yes
Referee: [Method] Method section (DSAA framework and K/V Modulator description): No ablation or analysis is provided showing that the modulation of Key/Value vectors for attribute tokens restores correct binding without offsetting degradation on category-level detection. If the modules or contrastive loss term affect overall performance, the claim that DSAA strengthens attribute semantics at the two critical stages without trade-offs requires targeted validation (e.g., separate category-only and attribute-only metrics).

Authors: We note that our reported results show net improvements on fine-grained metrics without degradation on standard OVD benchmarks, indicating that the K/V modulator and contrastive loss do not introduce obvious category-level trade-offs. Nevertheless, to supply the requested targeted validation we will include new ablations that separately report category-only and attribute-specific metrics. These additions will confirm that attribute semantics are strengthened at both stages without compromising category detection performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DSAA framework derivation

full rationale

The paper proposes new modules (Attribute Prefix Adapter, K/V Modulator) and an attribute-aware contrastive loss to strengthen attribute semantics in OVD models. These are presented as architectural interventions with empirical validation on FG-OVD benchmark; no equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central premise on category dominance is an interpretive attribution rather than a tautological derivation, and the method remains self-contained against external benchmarks without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests primarily on the domain assumption that category dominance is the root cause of poor attribute binding and that the two new modules plus contrastive loss will remedy it. No explicit free parameters are named in the abstract, though typical training hyperparameters for the modules and loss weight are expected. The new modules themselves are invented components without independent falsifiable evidence outside the proposed experiments.

free parameters (1)

Module-specific hyperparameters (prefix length, modulation scale, loss weight)
Standard tunable values in adapter and modulation designs; not enumerated in the abstract but required for training the proposed components.

axioms (1)

domain assumption Category signals dominate and marginalize attribute information during OVD inference, producing incorrect attribute-object binding
Explicitly stated in the abstract as the core issue to which the performance bottleneck is attributed.

invented entities (2)

Attribute Prefix Adapter (APA) no independent evidence
purpose: Generate attribute prefixes that inject explicit attribute priors into text embeddings
New module introduced for the text embedding stage.
Key/Value (K/V) Modulator no independent evidence
purpose: Selectively enhance Key and Value vectors of attribute tokens during BERT encoding
New intervention module for the encoding phase.

pith-pipeline@v0.9.0 · 5794 in / 1674 out tokens · 91445 ms · 2026-05-20T12:01:36.106827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. ... Attribute Prefix Adapter (APA) ... Key/Value (K/V) Modulator ... attribute-aware contrastive loss
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DSAA ... non-invasive framework that mitigates attribute marginalization ... +20.5 mAP on Grounding DINO

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22520–22529, 2024. 2, 5, 6

work page 2024
[2]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024. 1, 2

work page 2024
[3]

Open vocabulary object search utilizing large language models and fuzzy inferencing

Akash Chikhalikar, Ankit A Ravankar, Jose Victorio Salazar Luces, and Yasuhisa Hirata. Open vocabulary object search utilizing large language models and fuzzy inferencing. In 2025 IEEE/SICE International Symposium on System Inte- gration (SII), pages 345–351. IEEE, 2025. 1

work page 2025
[4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

work page 2019
[5]

Lami-detr: Open-vocabulary detection with language model instruction

Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, and Si Liu. Lami-detr: Open-vocabulary detection with language model instruction. InEuropean Conference on Computer Vision (ECCV), pages 312–328. Springer, 2024. 2

work page 2024
[6]

Prompt- det: Towards open-vocabulary detection using uncurated im- ages

Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- det: Towards open-vocabulary detection using uncurated im- ages. InEuropean Conference on Computer Vision (ECCV), pages 701–717, 2022. 1

work page 2022
[7]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019. 1

work page 2019
[9]

V oxposer: Composable 3d value maps for robotic manipulation with language models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InPro- ceedings of The 7th Conference on Robot Learning, pages 540–562. PMLR, 2023. 1

work page 2023
[10]

On the potential of open-vocabulary models for object detection in unusual street scenes

Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. InEuropean Conference on Computer Vision, pages 201–217. Springer, 2024. 1

work page 2024
[11]

Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023

Hila Levi, Guy Heller, Dan Levi, and Ethan Fetaya. Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023. 1

work page arXiv 2023
[12]

Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023

Liunian Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang. Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023. 2

work page 2023
[13]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, and Jenq-Neng Hwang. Grounded language-image pre-training. InCVPR, pages 10965–10975,

work page
[14]

Cliff: Continual latent diffusion for open-vocabulary object detec- tion

Wuyang Li, Xinyu Liu, Jiayi Ma, and Yixuan Yuan. Cliff: Continual latent diffusion for open-vocabulary object detec- tion. InECCV, pages 255–273, 2024. 1

work page 2024
[15]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9493–9500. IEEE, 2023. 1

work page 2023
[16]

Mi- crosoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Mi- crosoft coco: Common objects in context. InECCV, pages 740–755, 2014. 1

work page 2014
[17]

Shine: Semantic hierarchy nexus for open-vocabulary object detection

Mingxuan Liu, Tyler L Hayes, Elisa Ricci, Gabriela Csurka, and Riccardo V olpi. Shine: Semantic hierarchy nexus for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16634–16644, 2024. 2

work page 2024
[18]

Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: An open-vocabulary assistive system for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 1

work page arXiv 2024
[19]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55, 2024. 1, 2

work page 2024
[20]

Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025

Yuqi Ma, Mengyin Liu, Chao Zhu, and Xu-Cheng Yin. Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025. 2

work page 2025
[21]

Rethinking open-world object detection in autonomous driving scenarios

Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, and Mingxing Zhang. Rethinking open-world object detection in autonomous driving scenarios. InACM MM, pages 1279–1288, 2022. 1

work page 2022
[22]

Simple open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, et al. Simple open-vocabulary object detection. InECCV, pages 728–755, 2022. 2

work page 2022
[23]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[24]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1

work page 2016
[25]

Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017. 1

work page 2017
[26]

Eda-detr: Open-vocabulary ob- ject detection using early dense alignment

Cheng Shi and Sibei Yang. Eda-detr: Open-vocabulary ob- ject detection using early dense alignment. InICCV, pages 15724–15734, 2023. 2

work page 2023
[27]

Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024

Tjeard van Oort, Dimity Miller, Will N Browne, Nico- las Marticorena, Jesse Haviland, and Niko Suenderhauf. Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024. 1

work page arXiv 2024
[28]

Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024

Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yong- tao Wang, and Ming-Hsuan Yang. Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024. 1

work page arXiv 2024
[29]

A. Yang, A. Li, B. Yang, B. Zhang, and et al. Qwen3 techni- cal report.arXiv preprint arXiv:2505.09388, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

Lewei Yao, Jianhua Han, Xiaodan Liang, et al. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InCVPR, pages 23497–23506,

work page
[31]

Detclipv3: To- wards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, et al. Detclipv3: To- wards versatile generative open-vocabulary object detection. InCVPR, pages 27391–27401, 2024. 2

work page 2024
[32]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, pages 106–122. Springer, 2022. 2

work page 2022
[33]

Glipv2: Unifying localization and vision-language under- standing

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, et al. Glipv2: Unifying localization and vision-language under- standing. InNeurIPS, pages 36067–36080, 2022. 2

work page 2022
[34]

Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025

Xiaomei Zhang, Hanyue Ling, Xiao Huang, Qiwen Jin, and Jiwei Hu. Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025. 1

work page 2025
[35]

Region- clip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. Region- clip: Region-based language-image pretraining. InCVPR, pages 16793–16803, 2022. 2

work page 2022
[36]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010

[1] [1]

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22520–22529, 2024. 2, 5, 6

work page 2024

[2] [2]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024. 1, 2

work page 2024

[3] [3]

Open vocabulary object search utilizing large language models and fuzzy inferencing

Akash Chikhalikar, Ankit A Ravankar, Jose Victorio Salazar Luces, and Yasuhisa Hirata. Open vocabulary object search utilizing large language models and fuzzy inferencing. In 2025 IEEE/SICE International Symposium on System Inte- gration (SII), pages 345–351. IEEE, 2025. 1

work page 2025

[4] [4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

work page 2019

[5] [5]

Lami-detr: Open-vocabulary detection with language model instruction

Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, and Si Liu. Lami-detr: Open-vocabulary detection with language model instruction. InEuropean Conference on Computer Vision (ECCV), pages 312–328. Springer, 2024. 2

work page 2024

[6] [6]

Prompt- det: Towards open-vocabulary detection using uncurated im- ages

Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- det: Towards open-vocabulary detection using uncurated im- ages. InEuropean Conference on Computer Vision (ECCV), pages 701–717, 2022. 1

work page 2022

[7] [7]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019. 1

work page 2019

[9] [9]

V oxposer: Composable 3d value maps for robotic manipulation with language models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InPro- ceedings of The 7th Conference on Robot Learning, pages 540–562. PMLR, 2023. 1

work page 2023

[10] [10]

On the potential of open-vocabulary models for object detection in unusual street scenes

Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. InEuropean Conference on Computer Vision, pages 201–217. Springer, 2024. 1

work page 2024

[11] [11]

Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023

Hila Levi, Guy Heller, Dan Levi, and Ethan Fetaya. Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023. 1

work page arXiv 2023

[12] [12]

Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023

Liunian Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang. Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023. 2

work page 2023

[13] [13]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, and Jenq-Neng Hwang. Grounded language-image pre-training. InCVPR, pages 10965–10975,

work page

[14] [14]

Cliff: Continual latent diffusion for open-vocabulary object detec- tion

Wuyang Li, Xinyu Liu, Jiayi Ma, and Yixuan Yuan. Cliff: Continual latent diffusion for open-vocabulary object detec- tion. InECCV, pages 255–273, 2024. 1

work page 2024

[15] [15]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9493–9500. IEEE, 2023. 1

work page 2023

[16] [16]

Mi- crosoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Mi- crosoft coco: Common objects in context. InECCV, pages 740–755, 2014. 1

work page 2014

[17] [17]

Shine: Semantic hierarchy nexus for open-vocabulary object detection

Mingxuan Liu, Tyler L Hayes, Elisa Ricci, Gabriela Csurka, and Riccardo V olpi. Shine: Semantic hierarchy nexus for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16634–16644, 2024. 2

work page 2024

[18] [18]

Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: An open-vocabulary assistive system for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 1

work page arXiv 2024

[19] [19]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55, 2024. 1, 2

work page 2024

[20] [20]

Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025

Yuqi Ma, Mengyin Liu, Chao Zhu, and Xu-Cheng Yin. Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025. 2

work page 2025

[21] [21]

Rethinking open-world object detection in autonomous driving scenarios

Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, and Mingxing Zhang. Rethinking open-world object detection in autonomous driving scenarios. InACM MM, pages 1279–1288, 2022. 1

work page 2022

[22] [22]

Simple open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, et al. Simple open-vocabulary object detection. InECCV, pages 728–755, 2022. 2

work page 2022

[23] [23]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021

[24] [24]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1

work page 2016

[25] [25]

Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017. 1

work page 2017

[26] [26]

Eda-detr: Open-vocabulary ob- ject detection using early dense alignment

Cheng Shi and Sibei Yang. Eda-detr: Open-vocabulary ob- ject detection using early dense alignment. InICCV, pages 15724–15734, 2023. 2

work page 2023

[27] [27]

Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024

Tjeard van Oort, Dimity Miller, Will N Browne, Nico- las Marticorena, Jesse Haviland, and Niko Suenderhauf. Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024. 1

work page arXiv 2024

[28] [28]

Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024

Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yong- tao Wang, and Ming-Hsuan Yang. Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024. 1

work page arXiv 2024

[29] [29]

A. Yang, A. Li, B. Yang, B. Zhang, and et al. Qwen3 techni- cal report.arXiv preprint arXiv:2505.09388, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

Lewei Yao, Jianhua Han, Xiaodan Liang, et al. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InCVPR, pages 23497–23506,

work page

[31] [31]

Detclipv3: To- wards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, et al. Detclipv3: To- wards versatile generative open-vocabulary object detection. InCVPR, pages 27391–27401, 2024. 2

work page 2024

[32] [32]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, pages 106–122. Springer, 2022. 2

work page 2022

[33] [33]

Glipv2: Unifying localization and vision-language under- standing

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, et al. Glipv2: Unifying localization and vision-language under- standing. InNeurIPS, pages 36067–36080, 2022. 2

work page 2022

[34] [34]

Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025

Xiaomei Zhang, Hanyue Ling, Xiao Huang, Qiwen Jin, and Jiwei Hu. Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025. 1

work page 2025

[35] [35]

Region- clip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. Region- clip: Region-based language-image pretraining. InCVPR, pages 16793–16803, 2022. 2

work page 2022

[36] [36]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010