arxiv: 2604.02753 · v3 · submitted 2026-04-03 · 💻 cs.CV

Recognition: unknown

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang , Yanshu Li , Bohan Hu , Zhengdao Li , Haibo Zhan , Linshan Li , Weiming Liu , Ruizhi Qian

show 7 more authors

Guangxin Wu Hao Zhang Jifeng Shen Piotr Koniusz Zhengtao Yao Junhao Dong Qiang Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary object detectionDETRdecoupled cognitionsemantic prototype spacezero-shot detectioninference efficiencyLVLMCLIP

0 comments

The pith

DeCo-DETR decouples semantic understanding from localization for faster open-vocabulary detection

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeCo-DETR as a way to make open-vocabulary object detection practical by addressing both speed and accuracy trade-offs. It builds a reusable hierarchical semantic prototype space in advance using descriptions from pre-trained vision-language models aligned with CLIP, avoiding the need to run a text encoder every time an image is processed. A decoupled training strategy then optimizes the semantic alignment and the object localization tasks in separate streams. Experiments show this yields competitive results on zero-shot detection tasks while greatly reducing inference costs compared to multimodal baselines.

Core claim

DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams.

What carries the argument

Decoupled cognition paradigm using a hierarchical semantic prototype space built offline from LVLM descriptions and CLIP alignments, which carries the semantic knowledge for vision-only inference.

If this is right

Competitive zero-shot detection performance on standard OVOD benchmarks
Significant improvement in inference efficiency by avoiding online text encoding
Effective separation of semantic cognition from detection optimization
Practical direction toward scalable open-vocabulary detection systems

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Advances in large vision-language models would likely improve the quality of the prototype space and thus the detector's performance.
This offline prototype approach might extend to video object detection or instance segmentation for similar efficiency benefits.
Evaluating the method on datasets with domain shifts could test how robust the fixed prototypes are to distribution changes.

Load-bearing premise

The region-level descriptions from pre-trained large vision-language models, when aligned with CLIP, form an accurate and reusable semantic prototype space that enables open-vocabulary generalization without any text encoding during inference.

What would settle it

Running the model on a standard zero-shot OVOD benchmark and finding that its average precision on novel categories is substantially lower than methods that use text encoders at inference time.

Figures

Figures reproduced from arXiv: 2604.02753 by Bohan Hu, Guangxin Wu, Haibo Zhan, Hao Zhang, Jifeng Shen, Junhao Dong, Linshan Li, Piotr Koniusz, Qiang Sun, Ruizhi Qian, Siheng Wang, Weiming Liu, Yanshu Li, Zhengdao Li, Zhengtao Yao.

**Figure 2.** Figure 2: Qualitative comparison between DeCo-DETR and the baseline. DeCo-DETR shows [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Open-vocabulary object detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeCo-DETR decouples offline prototype construction from detection training to remove text encoders at inference, but the performance claims rest on unshown experiments and the prototype coverage may have gaps.

read the letter

The one thing to know about this paper is that DeCo-DETR decouples the construction of a semantic prototype space from the actual detection training. It builds that space offline using region descriptions from pre-trained LVLMs aligned through CLIP, then optimizes alignment and detection in separate streams so no text encoder runs at inference. This framing is new in how it unifies the decoupling to target both efficiency and the closed-set versus open-world trade-off. The paper does well at laying out why existing multimodal approaches struggle with deployment speed, and the offline prototype idea is a direct response to that. If the experiments back it up, it could be a practical step forward for applications that need fast open-vocabulary detection. The soft spot is that the abstract only asserts extensive experiments without showing any numbers or controls here. The soundness depends entirely on those results, which we can't verify from the summary. There's also the risk that the prototype space leaves gaps for objects or attributes the LVLM didn't describe well, which would hurt generalization without a way to recover at runtime. This paper is aimed at the computer vision community working on open-vocabulary and efficient detection models. A reader focused on practical improvements in inference would get value from the decoupling strategy, while someone wanting rigorous proof of broad generalization might want more evidence. I would send it for peer review. The core idea is worth referees examining the full experiments and any ablations on the prototype construction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework for open-vocabulary object detection. It constructs a hierarchical semantic prototype space offline from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, eliminating the need for online text encoding at inference. A decoupled training strategy separates semantic alignment from localization into parallel optimization streams. The central claim is that this yields competitive zero-shot detection performance on standard OVOD benchmarks while substantially improving inference efficiency compared to multimodal baselines.

Significance. If the reported results hold under scrutiny, the decoupling of semantic prototype construction from detection offers a practical route to efficient OVOD deployment by removing text-encoder overhead at test time. The offline prototype space and parallel training streams address a recognized tension between closed-set accuracy and open-world generalization, potentially enabling scalable systems for real-world applications where inference speed matters.

major comments (2)

[Abstract] Abstract: the claim of 'competitive zero-shot detection performance' and 'significantly improving inference efficiency' rests on 'extensive experiments on standard OVOD benchmarks,' yet the provided text contains no quantitative results, tables, error bars, ablation details, or specific benchmark numbers; this absence is load-bearing because the efficiency and generalization advantages cannot be evaluated without them.
[Method (prototype construction)] Method section on prototype construction: the hierarchical semantic prototype space is asserted to supply reusable embeddings for arbitrary novel categories without online text encoding, but no coverage analysis, failure cases, or experiments on out-of-distribution concepts (rare objects, abstract attributes) are referenced; this directly undermines the zero-shot claim given the reliance on pre-trained LVLM descriptions.

minor comments (2)

[Abstract] Abstract: the phrase 'decoupled cognition paradigm' is used without a concise definition or pointer to the relevant section, which would aid immediate comprehension.
[Experiments] The manuscript should include a clear statement of the exact OVOD benchmarks, evaluation protocol (e.g., zero-shot vs. few-shot splits), and comparison baselines in the experimental section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'competitive zero-shot detection performance' and 'significantly improving inference efficiency' rests on 'extensive experiments on standard OVOD benchmarks,' yet the provided text contains no quantitative results, tables, error bars, ablation details, or specific benchmark numbers; this absence is load-bearing because the efficiency and generalization advantages cannot be evaluated without them.

Authors: We agree that the abstract, being a concise summary, does not include specific numerical results. The full manuscript presents these details in the Experiments section, including tables with mAP scores on standard OVOD benchmarks, ablation studies, error bars where applicable, and direct comparisons of inference efficiency against multimodal baselines. To address the concern, we will revise the abstract to include a brief mention of key quantitative outcomes supporting the claims. revision: yes
Referee: [Method (prototype construction)] Method section on prototype construction: the hierarchical semantic prototype space is asserted to supply reusable embeddings for arbitrary novel categories without online text encoding, but no coverage analysis, failure cases, or experiments on out-of-distribution concepts (rare objects, abstract attributes) are referenced; this directly undermines the zero-shot claim given the reliance on pre-trained LVLM descriptions.

Authors: The hierarchical semantic prototype space is built offline from region-level LVLM descriptions aligned through CLIP, with the goal of providing reusable embeddings for novel categories. Our zero-shot results on standard benchmarks demonstrate the practical effectiveness of this construction. We acknowledge that the current manuscript does not include dedicated coverage analysis, failure cases, or targeted experiments on out-of-distribution concepts such as rare objects or abstract attributes. We will add a dedicated paragraph discussing these aspects and potential limitations of the LVLM-based prototype construction. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation relies on external pre-trained models and independent benchmarks

full rationale

The paper's core construction of a hierarchical semantic prototype space uses region-level descriptions from pre-trained LVLMs aligned via CLIP, followed by decoupled training streams for alignment and detection. No equations or central claims reduce by construction to fitted parameters defined inside the paper, nor do they depend on load-bearing self-citations whose validity is unverified externally. Performance is reported on standard OVOD benchmarks as an independent evaluation. This matches the default expectation of a self-contained method with external components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework rests on the unstated assumption that pre-trained LVLMs and CLIP produce faithful region semantics.

pith-pipeline@v0.9.0 · 5547 in / 1115 out tokens · 52696 ms · 2026-05-14T21:43:35.042796+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
cs.CL 2026-05 unverdicted novelty 5.0

EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

X-detr: A versatile architecture for instance-wise vision-language tasks

12 Published as a conference paper at ICLR 2026 Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, and Stefano Soatto. X-detr: A versatile architecture for instance-wise vision-language tasks. In ECCV, 2022a. Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, and Stefano Soatto. X-d...

work page arXiv 2026
[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko

doi: 10.1109/TITS.2022.3215572. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers,

work page doi:10.1109/tits.2022.3215572 2022
[3]

Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

URLhttps: //arxiv.org/abs/2005.12872. Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325,

work page arXiv 2005
[4]

Promptdet: Expand your detector vocabulary with uncurated images

Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Expand your detector vocabulary with uncurated images. InECCV, 2022a. Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated image...

work page arXiv
[5]

Open-vocabulary object detection via vision and language knowledge distillation,

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

work page arXiv
[6]

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu

URLhttps://arxiv.org/abs/1908.03195. Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation,

work page arXiv 1908
[7]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V

URL https://arxiv.org/abs/2305.03944. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision,

work page arXiv
[8]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

URLhttps://arxiv.org/abs/2102.05918. 13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors.arXiv preprint arXiv:2402.04630,

work page arXiv 2026
[9]

Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J Kim

URLhttps://arxiv.org/abs/2401.02418. Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J Kim. Retrieval-augmented open- vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17427–17436,

work page arXiv
[10]

Improving clip robustness with knowledge distillation and self-training.arXiv preprint arXiv:2309.10361,

Clement Laroudie, Andrei Bursuc, Mai Lan Ha, and Gianni Franchi. Improving clip robustness with knowledge distillation and self-training.arXiv preprint arXiv:2309.10361,

work page arXiv
[11]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26617–26626, 2024b. Zichao Li and Zong Ke. Cross-modal augmentation for low-resource language understanding and gen...

work page 2025
[12]

Learning object-language alignments for open-vocabulary object detection.arXiv preprint arXiv:2211.14843,

Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai. Learning object-language alignments for open-vocabulary object detection.arXiv preprint arXiv:2211.14843,

work page arXiv
[13]

Microsoft COCO: Common Objects in Context

URLhttps://arxiv.org/abs/1405.0312. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoni...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Cake: Category aware knowledge ex- traction for open-vocabulary object detection

Shiyuan Ma, Donglin Qian, Kai Ye, and Shengchuan Zhang. Cake: Category aware knowledge ex- traction for open-vocabulary object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 5982–5990, 2025a. Shiyuan Ma, Donglin Qian, Kai Ye, and Shengchuan Zhang. Cake: Category aware knowledge ex- traction for open-vocabulary o...

work page arXiv
[15]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021a. URL https://arxiv.org/abs/2103.00020. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ra...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Bridging the gap between object and image-level representations for open-vocabulary detection, 2022a

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, and Fahad Shah- baz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection, 2022a. URLhttps://arxiv.org/abs/2207.03482. Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, and Fahad Shah- baz Khan. Bridging the gap between obj...

work page arXiv
[17]

Objective over architecture: Fraud detection under extreme imbalance in bank account opening.Computation, 13(12):290,

15 Published as a conference paper at ICLR 2026 Wenxi Sun, Qiannan Shen, Yijun Gao, Qinkai Mao, Tongsong Qi, and Shuo Xu. Objective over architecture: Fraud detection under extreme imbalance in bank account opening.Computation, 13(12):290,

work page 2026
[18]

URLhttps://www.mdpi.com/ 2079-3197/13/12/290

doi: 10.3390/computation13120290. URLhttps://www.mdpi.com/ 2079-3197/13/12/290. Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, and Yabiao Wang. Mamba- yolo-world: marrying yolo-world with mamba for open-vocabulary detection. InICASSP 2025- 2025 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1–5. IEEE,

work page doi:10.3390/computation13120290 2079
[19]

Open- vocabulary object detection with an open corpus

Jiong Wang, Huiming Zhang, Haiwen Hong, Xuan Jin, Yuan He, Hui Xue, and Zhou Zhao. Open- vocabulary object detection with an open corpus. InProceedings of the IEEE/CVF international conference on computer vision, pp. 6759–6769, 2023a. Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu. Object-aware distillation p...

work page arXiv
[20]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

URL https://arxiv.org/abs/2402.13116. Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng. Semantics-guided contrastive network for zero-shot object detection.TPAMI,

work page arXiv
[21]

Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu

URLhttps://arxiv.org/abs/2108.07482. Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open- world detection, 2022a. URLhttps://arxiv.org/abs/2209.09407. Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Z...

work page arXiv 2026
[22]

Cyclic contrastive knowledge transfer for open-vocabulary object detection.arXiv preprint arXiv:2503.11005,

Chuhan Zhang, Chaoyang Zhu, Pingcheng Dong, Long Chen, and Dong Zhang. Cyclic contrastive knowledge transfer for open-vocabulary object detection.arXiv preprint arXiv:2503.11005,

work page arXiv
[23]

G, Anastasis Stathopou- los, Manmohan Chandraker, and Dimitris Metaxas

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar B. G, Anastasis Stathopou- los, Manmohan Chandraker, and Dimitris Metaxas. Exploiting unlabeled data with vision and lan- guage models for object detection, 2022a. URLhttps://arxiv.org/abs/2207.08954. Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopo...

work page arXiv
[24]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr ¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. InEuropean conference on computer vision, pp. 350–368. Springer, 2022a. Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr ¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level su...

work page arXiv
[25]

Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama

URLhttps://arxiv.org/abs/2307.09220. Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Zero shot detection.TCSVT, 30(4), 2020a. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: De- formable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020b. Zhengxia Zou, Keyan Chen, Zhenwei Shi, Y...

work page arXiv 2010
[26]

Caption”: in-domain captions like COCO-Captions. “Category Prior

URLhttps://arxiv.org/abs/1905.05055. 17 Published as a conference paper at ICLR 2026 A APPENDIX A.1 USE OFLLM We use LLM to aid or polish writing. Details are described in the paper. A.2 ETHICSSTATEMENT This work adheres to the ICLR Code of Ethics. Our study does not involve human subjects, per- sonal or sensitive data. All datasets used in this paper (e....

work page arXiv 1905