pith. sign in

arxiv: 2605.18023 · v1 · pith:MOH3FBK2new · submitted 2026-05-18 · 💻 cs.CV

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

Pith reviewed 2026-05-20 12:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary object detectionfine-grained detectionattribute activationattribute prefix adapterkey value modulatorcontrastive lossattribute bindingdual-stage framework
0
0 comments X

The pith

Open-vocabulary detection models bind attributes to the wrong objects when category signals dominate inference, and the DSAA framework corrects this by activating attribute information at two stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that open-vocabulary object detection fails at fine-grained tasks with attributes such as color, material, and texture because category signals overpower and sideline attribute details during inference. This produces mismatched bindings between attributes and objects. The authors introduce the Dual-Stage Attribute Activation framework to strengthen attribute semantics first in the text embedding stage and again during BERT encoding. They add an Attribute Prefix Adapter to inject explicit attribute priors, use a Key/Value Modulator to boost the relevant token vectors, and apply an attribute-aware contrastive loss to sharpen distinctions between same-category items that differ only in attributes. If correct, the approach would raise accuracy on detailed attribute queries for both seen and unseen categories while preserving overall detection performance.

Core claim

We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during

What carries the argument

The Dual-Stage Attribute Activation (DSAA) framework, which strengthens attribute semantics by generating explicit attribute prefixes with the Attribute Prefix Adapter in the text embedding stage and selectively enhancing the Key and Value vectors of attribute tokens with the K/V Modulator during BERT encoding, plus an attribute-aware contrastive loss for training discrimination.

If this is right

  • Strengthens attribute semantics at the text embedding and BERT encoding stages to improve fine-grained detection.
  • Enables better discrimination among same-category instances that differ only in attributes.
  • Applies to various mainstream open-vocabulary detection models without changing their core category detection.
  • Raises performance on tasks that require identifying specific colors, materials, and textures in unseen categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage activation pattern may help other vision-language models that suffer from weak attribute grounding in captioning or visual question answering.
  • Extending the modulator to additional token types could address fine-grained signals beyond attributes, such as spatial relations or actions.
  • Testing the framework on datasets with rarer attribute combinations would reveal whether the binding correction scales to long-tail cases.

Load-bearing premise

The performance bottleneck arises specifically because category signals dominate and marginalize attribute information during inference, producing incorrect attribute-object bindings that the APA module, K/V Modulator, and contrastive loss can fix without offsetting errors.

What would settle it

Running the DSAA modules on the FG-OVD benchmark and finding no gain in fine-grained attribute detection accuracy or a drop in standard category-level performance would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.18023 by Chuang Zhu, Donghong Jiang, Endian Lin, Hanqing Liu, Luoping Cui, Mingjie Liu, Zhao Yang.

Figure 1
Figure 1. Figure 1: Motivating example comparing Grounding DINO and DSAA on attribute-sensitive prompts. The baseline model (left) confuses attribute semantics, assigning high confidence to invalid compositions such as “a green dog”, while DSAA (right) correctly rejects mismatched queries and preserves consistent at￾tribute–object binding. abling models to recognize arbitrary categories through nat￾ural language prompts. This… view at source ↗
Figure 2
Figure 2. Figure 2: Inference pipeline with the proposed Dual-Stage Attribute Activation (DSAA). DSAA activates fine-grained at￾tribute semantics in two stages: (1) An Attribute Prefix Adapter injects explicit attribute priors into the text embedding space, and (2) a K/V Modulator selectively amplifies attribute tokens within early text encoder layers. related work to ours. It assumes that pretrained OVD models already contai… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Dual-Stage Attribute Activation (DSAA) framework. The overall workflow consists of three main components: (1) Attribute Words Extraction: an LLM identifies attribute words and their token positions from input text; (2) Attribute Prefix Insertion: extracted attributes are converted by Attribute Prefix Adapter into prefix tokens and inserted into text embeddings as attribute semantic… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of Grounding DINO with and without DSAA on the FG-OVD benchmark. Each row presents detection results under attribute-rich text queries. The top row shows predictions from the baseline, and the bottom row shows those from DSAA. Green/blue boxes denote positive captions (correct attribute–object matches), while red/orange boxes denote negative captions (incorrect or mismatched composition… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of attribute embeddings on Grounding DINO. DSAA forms more compact and semantically consistent clusters across attributes, demonstrating its effective￾ness in improving the separation and coherence of attribute repre￾sentations compared to the baseline [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature distance distribution. DSAA increases the separation between positive and negative samples by 30.2%, demonstrating enhanced feature discriminability. criminability and fine-grained differentiation capability. 5. Conclusion In this work, we identified a core limitation of current open￾vocabulary detectors: attribute semantics are marginalized under strong category priors, hindering fine-grained reco… view at source ↗
read the original abstract

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Dual-Stage Attribute Activation (DSAA) framework to improve fine-grained open-vocabulary object detection (OVD). It attributes the performance bottleneck to category signals dominating and marginalizing attribute information during inference, which causes incorrect attribute-object binding. DSAA addresses this via an Attribute Prefix Adapter (APA) module that injects explicit attribute priors in the text embedding stage, a Key/Value (K/V) Modulator that selectively enhances attribute token vectors during BERT encoding, and an attribute-aware contrastive loss to improve discrimination among same-category instances with differing attributes. Experiments on the FG-OVD benchmark are reported to demonstrate effectiveness across mainstream OVD models.

Significance. If the empirical gains on FG-OVD hold and arise specifically from improved attribute binding rather than generic capacity or supervision effects, the work could offer a practical, targeted enhancement for fine-grained OVD. The dual-stage design and attribute-aware loss provide concrete architectural interventions, and the explicit introduction of APA and K/V Modulator modules supplies a reproducible recipe for strengthening attribute semantics.

major comments (2)
  1. [Introduction] Introduction (core premise paragraph): The claim that category signals dominate and marginalize attribute information (leading to incorrect binding) is presented as the central bottleneck but lacks direct supporting diagnostics such as attention weight comparisons between attribute and category tokens, embedding cosine distances, or failure-case visualizations conditioned on the presence of both signals. Without this evidence, it remains unclear whether the proposed APA and K/V Modulator specifically correct the hypothesized mechanism or whether gains arise from other factors.
  2. [Method] Method section (DSAA framework and K/V Modulator description): No ablation or analysis is provided showing that the modulation of Key/Value vectors for attribute tokens restores correct binding without offsetting degradation on category-level detection. If the modules or contrastive loss term affect overall performance, the claim that DSAA strengthens attribute semantics at the two critical stages without trade-offs requires targeted validation (e.g., separate category-only and attribute-only metrics).
minor comments (2)
  1. [Abstract] Abstract: The statement that results 'demonstrate the effectiveness of our method across various mainstream open-vocabulary models' would be strengthened by naming the specific models and reporting at least one key quantitative delta (e.g., mAP improvement on FG-OVD).
  2. [Method] Notation and equations: The definition of the modulation scale and how it is applied to the Key and Value vectors in the K/V Modulator should be given explicitly (e.g., as an equation) to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying our approach and committing to targeted revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Introduction] Introduction (core premise paragraph): The claim that category signals dominate and marginalize attribute information (leading to incorrect binding) is presented as the central bottleneck but lacks direct supporting diagnostics such as attention weight comparisons between attribute and category tokens, embedding cosine distances, or failure-case visualizations conditioned on the presence of both signals. Without this evidence, it remains unclear whether the proposed APA and K/V Modulator specifically correct the hypothesized mechanism or whether gains arise from other factors.

    Authors: We acknowledge that the introduction relies on the observed performance gains on FG-OVD to support the category-dominance hypothesis rather than providing explicit diagnostics such as attention maps or cosine similarities. While these gains across multiple OVD backbones are consistent with improved attribute binding, we agree that direct evidence would more rigorously isolate the mechanism. In the revised manuscript we will add attention weight comparisons between attribute and category tokens along with conditioned failure-case visualizations to demonstrate the binding issue in baselines and its alleviation by DSAA. revision: yes

  2. Referee: [Method] Method section (DSAA framework and K/V Modulator description): No ablation or analysis is provided showing that the modulation of Key/Value vectors for attribute tokens restores correct binding without offsetting degradation on category-level detection. If the modules or contrastive loss term affect overall performance, the claim that DSAA strengthens attribute semantics at the two critical stages without trade-offs requires targeted validation (e.g., separate category-only and attribute-only metrics).

    Authors: We note that our reported results show net improvements on fine-grained metrics without degradation on standard OVD benchmarks, indicating that the K/V modulator and contrastive loss do not introduce obvious category-level trade-offs. Nevertheless, to supply the requested targeted validation we will include new ablations that separately report category-only and attribute-specific metrics. These additions will confirm that attribute semantics are strengthened at both stages without compromising category detection performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DSAA framework derivation

full rationale

The paper proposes new modules (Attribute Prefix Adapter, K/V Modulator) and an attribute-aware contrastive loss to strengthen attribute semantics in OVD models. These are presented as architectural interventions with empirical validation on FG-OVD benchmark; no equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central premise on category dominance is an interpretive attribution rather than a tautological derivation, and the method remains self-contained against external benchmarks without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests primarily on the domain assumption that category dominance is the root cause of poor attribute binding and that the two new modules plus contrastive loss will remedy it. No explicit free parameters are named in the abstract, though typical training hyperparameters for the modules and loss weight are expected. The new modules themselves are invented components without independent falsifiable evidence outside the proposed experiments.

free parameters (1)
  • Module-specific hyperparameters (prefix length, modulation scale, loss weight)
    Standard tunable values in adapter and modulation designs; not enumerated in the abstract but required for training the proposed components.
axioms (1)
  • domain assumption Category signals dominate and marginalize attribute information during OVD inference, producing incorrect attribute-object binding
    Explicitly stated in the abstract as the core issue to which the performance bottleneck is attributed.
invented entities (2)
  • Attribute Prefix Adapter (APA) no independent evidence
    purpose: Generate attribute prefixes that inject explicit attribute priors into text embeddings
    New module introduced for the text embedding stage.
  • Key/Value (K/V) Modulator no independent evidence
    purpose: Selectively enhance Key and Value vectors of attribute tokens during BERT encoding
    New intervention module for the encoding phase.

pith-pipeline@v0.9.0 · 5794 in / 1674 out tokens · 91445 ms · 2026-05-20T12:01:36.106827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

    Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22520–22529, 2024. 2, 5, 6

  2. [2]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024. 1, 2

  3. [3]

    Open vocabulary object search utilizing large language models and fuzzy inferencing

    Akash Chikhalikar, Ankit A Ravankar, Jose Victorio Salazar Luces, and Yasuhisa Hirata. Open vocabulary object search utilizing large language models and fuzzy inferencing. In 2025 IEEE/SICE International Symposium on System Inte- gration (SII), pages 345–351. IEEE, 2025. 1

  4. [4]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

  5. [5]

    Lami-detr: Open-vocabulary detection with language model instruction

    Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, and Si Liu. Lami-detr: Open-vocabulary detection with language model instruction. InEuropean Conference on Computer Vision (ECCV), pages 312–328. Springer, 2024. 2

  6. [6]

    Prompt- det: Towards open-vocabulary detection using uncurated im- ages

    Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- det: Towards open-vocabulary detection using uncurated im- ages. InEuropean Conference on Computer Vision (ECCV), pages 701–717, 2022. 1

  7. [7]

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

  8. [8]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019. 1

  9. [9]

    V oxposer: Composable 3d value maps for robotic manipulation with language models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InPro- ceedings of The 7th Conference on Robot Learning, pages 540–562. PMLR, 2023. 1

  10. [10]

    On the potential of open-vocabulary models for object detection in unusual street scenes

    Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. InEuropean Conference on Computer Vision, pages 201–217. Springer, 2024. 1

  11. [11]

    Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023

    Hila Levi, Guy Heller, Dan Levi, and Ethan Fetaya. Object- centric open-vocabulary image-retrieval with aggregated fea- tures.arXiv preprint arXiv:2309.14999, 2023. 1

  12. [12]

    Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023

    Liunian Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang. Desco: Learning object recognition with rich language de- scriptions.NeurIPS, 36:37511–37526, 2023. 2

  13. [13]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, and Jenq-Neng Hwang. Grounded language-image pre-training. InCVPR, pages 10965–10975,

  14. [14]

    Cliff: Continual latent diffusion for open-vocabulary object detec- tion

    Wuyang Li, Xinyu Liu, Jiayi Ma, and Yixuan Yuan. Cliff: Continual latent diffusion for open-vocabulary object detec- tion. InECCV, pages 255–273, 2024. 1

  15. [15]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9493–9500. IEEE, 2023. 1

  16. [16]

    Mi- crosoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Mi- crosoft coco: Common objects in context. InECCV, pages 740–755, 2014. 1

  17. [17]

    Shine: Semantic hierarchy nexus for open-vocabulary object detection

    Mingxuan Liu, Tyler L Hayes, Elisa Ricci, Gabriela Csurka, and Riccardo V olpi. Shine: Semantic hierarchy nexus for open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16634–16644, 2024. 2

  18. [18]

    Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

    Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: An open-vocabulary assistive system for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 1

  19. [19]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55, 2024. 1, 2

  20. [20]

    Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025

    Yuqi Ma, Mengyin Liu, Chao Zhu, and Xu-Cheng Yin. Ha- fgovd: Highlighting fine-grained attributes via explicit lin- ear composition for open-vocabulary object detection.IEEE Transactions on Multimedia (TMM), 2025. 2

  21. [21]

    Rethinking open-world object detection in autonomous driving scenarios

    Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, and Mingxing Zhang. Rethinking open-world object detection in autonomous driving scenarios. InACM MM, pages 1279–1288, 2022. 1

  22. [22]

    Simple open-vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, et al. Simple open-vocabulary object detection. InECCV, pages 728–755, 2022. 2

  23. [23]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  24. [24]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1

  25. [25]

    Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2017. 1

  26. [26]

    Eda-detr: Open-vocabulary ob- ject detection using early dense alignment

    Cheng Shi and Sibei Yang. Eda-detr: Open-vocabulary ob- ject detection using early dense alignment. InICCV, pages 15724–15734, 2023. 2

  27. [27]

    Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024

    Tjeard van Oort, Dimity Miller, Will N Browne, Nico- las Marticorena, Jesse Haviland, and Niko Suenderhauf. Open-vocabulary part-based grasping.arXiv preprint arXiv:2406.05951, 2024. 1

  28. [28]

    Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024

    Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yong- tao Wang, and Ming-Hsuan Yang. Openad: Open-world au- tonomous driving benchmark for 3d object detection.arXiv preprint arXiv:2411.17761, 2024. 1

  29. [29]

    A. Yang, A. Li, B. Yang, B. Zhang, and et al. Qwen3 techni- cal report.arXiv preprint arXiv:2505.09388, 2025. 3

  30. [30]

    Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

    Lewei Yao, Jianhua Han, Xiaodan Liang, et al. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InCVPR, pages 23497–23506,

  31. [31]

    Detclipv3: To- wards versatile generative open-vocabulary object detection

    Lewei Yao, Renjie Pi, Jianhua Han, et al. Detclipv3: To- wards versatile generative open-vocabulary object detection. InCVPR, pages 27391–27401, 2024. 2

  32. [32]

    Open-vocabulary detr with conditional matching

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, pages 106–122. Springer, 2022. 2

  33. [33]

    Glipv2: Unifying localization and vision-language under- standing

    Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, et al. Glipv2: Unifying localization and vision-language under- standing. InNeurIPS, pages 36067–36080, 2022. 2

  34. [34]

    Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025

    Xiaomei Zhang, Hanyue Ling, Xiao Huang, Qiwen Jin, and Jiwei Hu. Ovgrasp: Target-oriented open-vocabulary robotic grasping in clutter.Robotics and Autonomous Systems, page 105210, 2025. 1

  35. [35]

    Region- clip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. Region- clip: Region-based language-image pretraining. InCVPR, pages 16793–16803, 2022. 2

  36. [36]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1