Dual-R-DETR: Resolving Query Competition with Pairwise Routing in Transformer Decoders
Pith reviewed 2026-05-16 21:35 UTC · model grok-4.3
The pith
Dual-R-DETR adds suppressor and delegator routing to DETR decoder attention to reduce duplicate object queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dual-R-DETR distinguishes query-to-query relations as either competitive or cooperative based on appearance similarity, prediction confidence, and spatial geometry. It introduces suppressor routing to attenuate interactions among queries targeting the same object and delegator routing to encourage diversification across distinct regions. These behaviors are realized through lightweight, learnable low-rank biases injected into decoder self-attention, enabling asymmetric query interactions while preserving the standard attention formulation and adding no inference overhead.
What carries the argument
Pairwise routing implemented as suppressor and delegator low-rank biases injected into decoder self-attention during training.
If this is right
- DETR variants gain accuracy on COCO and Cityscapes without extra inference cost.
- Fewer queries collapse onto the same object, leaving more queries free to cover distinct regions.
- The same routing can be applied to multiple DETR variants including those with Swin backbones.
- Training uses a dual-branch strategy so the final model is identical to vanilla DETR at test time.
Where Pith is reading between the lines
- The same routing signals could be inspected after training to reveal which object features the model treats as most competitive.
- Because the biases are removed at inference, any accuracy gain must arise solely from altered optimization dynamics rather than architectural change.
- The approach may extend to other set-prediction transformers where uniform attention produces redundant assignments.
- If the routing patterns stabilize late in training, the biases might be folded into the attention weights to eliminate the dual branch entirely.
Load-bearing premise
Query relations can be labeled competitive or cooperative reliably from appearance similarity, prediction confidence, and spatial geometry, and the resulting routing will steer optimization to better coverage without instability or collapse.
What would settle it
Replace the learned routing biases with zero or random values during training and measure whether mAP falls back to the level of the unmodified baseline DETR model.
read the original abstract
Detection Transformers (DETR) formulate object detection as a set prediction problem and enable end-to-end training without post-processing. However, object queries in DETR interact through symmetric self-attention, which enforces uniform competition among all query pairs. This often leads to inefficient query dynamics, where multiple queries converge to the same object while others fail to explore alternative regions. We propose Dual-R-DETR, a competition-aware DETR framework that explicitly regulates query interactions via pairwise routing in transformer decoders. Dual-R-DETR distinguishes query-to-query relations as either competitive or cooperative based on appearance similarity, prediction confidence, and spatial geometry. It introduces two complementary routing behaviors: suppressor routing to attenuate interactions among queries targeting the same object, and delegator routing to encourage diversification across distinct regions. These behaviors are realized through lightweight, learnable low-rank biases injected into decoder self-attention, enabling asymmetric query interactions while preserving the standard attention formulation. To ensure inference efficiency, routing biases are applied only during training using a dual-branch strategy, and inference reverts to vanilla self-attention with no additional computational cost. Extensive experiments on COCO and Cityscapes demonstrate that Dual-R-DETR consistently improves multiple DETR variants, outperforming DINO by 1.7% mAP with a ResNet-50 backbone and achieving 57.6% mAP with Swin-L under comparable settings. Code is available at https://github.com/YZk67/Dual-R-DETR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Dual-R-DETR, which modifies DETR-style object detectors by replacing symmetric self-attention in the decoder with pairwise routing that labels query pairs as competitive or cooperative using appearance similarity, prediction confidence, and spatial geometry. It introduces suppressor routing to dampen interactions among queries converging on the same object and delegator routing to encourage queries to explore distinct regions; both are realized as lightweight learnable low-rank biases added to the attention matrix only during training via a dual-branch schedule, reverting to standard attention at inference. Experiments on COCO and Cityscapes report consistent gains across DETR variants, including +1.7% mAP over DINO with ResNet-50 and 57.6% mAP with Swin-L.
Significance. If the routing mechanism is shown to be robust, the approach supplies a training-only, zero-cost-at-inference method for breaking query symmetry in set-prediction detectors. The public code release and reproducible gains on standard benchmarks constitute a concrete strength that would allow the community to verify and extend the empirical claims.
major comments (3)
- [§3.2] §3.2: The central claim that appearance similarity, confidence, and spatial geometry reliably distinguish competitive from cooperative query pairs is load-bearing for the routing mechanism, yet the manuscript provides no direct validation (e.g., qualitative analysis of mislabeled pairs or controlled experiments where similar-appearance queries target different objects). Without such evidence the observed mAP gains could arise from incidental regularization rather than the intended competition resolution.
- [§4.1] §4.1, Eq. (routing bias definition): the precise computation of the low-rank suppressor and delegator biases from the pairwise labels is not fully derived in the text; only the high-level injection into self-attention is described. This omission prevents independent verification of whether the biases remain parameter-efficient and numerically stable across training stages.
- [Table 2] Table 2 (ablation section): no ablation on bias rank or on the effect of early-training confidence scores is reported, even though the skeptic note correctly flags that unreliable predictions at initialization could produce systematically incorrect routing labels and thereby undermine the claimed mechanism.
minor comments (2)
- [§3] The dual-branch training schedule is mentioned in the abstract and §3 but never given an explicit diagram or pseudocode; a small figure would clarify how the routing branch is detached at inference.
- [Notation] Notation for the routing matrices (e.g., B_s and B_d) is introduced without a consolidated table of symbols, which slightly hinders readability for readers outside the immediate DETR literature.
Simulated Author's Rebuttal
We sincerely thank the referee for the insightful and constructive feedback on our manuscript. We have addressed each major comment in detail below and will incorporate the suggested revisions to improve clarity and strengthen the empirical validation of our approach.
read point-by-point responses
-
Referee: [§3.2] §3.2: The central claim that appearance similarity, confidence, and spatial geometry reliably distinguish competitive from cooperative query pairs is load-bearing for the routing mechanism, yet the manuscript provides no direct validation (e.g., qualitative analysis of mislabeled pairs or controlled experiments where similar-appearance queries target different objects). Without such evidence the observed mAP gains could arise from incidental regularization rather than the intended competition resolution.
Authors: We acknowledge that direct validation of the pairwise routing labels would strengthen the central claim. The current manuscript provides indirect evidence through component ablations in Table 2, demonstrating that both suppressor and delegator routing contribute to the performance gains. However, to address this concern directly, we will add qualitative visualizations in the revised manuscript showing examples of query pairs classified as competitive (high appearance similarity, overlapping predictions) versus cooperative, along with cases where similar-appearance queries are correctly delegated to distinct objects. Additionally, we will include a controlled experiment that randomizes the routing labels to isolate the effect from incidental regularization. revision: yes
-
Referee: [§4.1] §4.1, Eq. (routing bias definition): the precise computation of the low-rank suppressor and delegator biases from the pairwise labels is not fully derived in the text; only the high-level injection into self-attention is described. This omission prevents independent verification of whether the biases remain parameter-efficient and numerically stable across training stages.
Authors: We agree that the exact computation of the biases requires more explicit derivation for reproducibility. In the revised Section 4.1, we will provide the full mathematical formulation: the pairwise labels are first encoded into a feature vector combining appearance similarity (cosine of query features), confidence (product of query confidences), and spatial geometry (IoU of predicted boxes). This vector is then projected to low-rank matrices U and V (rank r=8) to form the bias B = U V^T, which is added to the attention logits. We will also report the parameter overhead (approximately 0.1% of the decoder) and include a stability analysis showing that the biases remain bounded due to the low-rank constraint and normalization. revision: yes
-
Referee: [Table 2] Table 2 (ablation section): no ablation on bias rank or on the effect of early-training confidence scores is reported, even though the skeptic note correctly flags that unreliable predictions at initialization could produce systematically incorrect routing labels and thereby undermine the claimed mechanism.
Authors: We thank the referee for this observation. We will expand the ablation study to include variations in bias rank (r=4, 8, 16), demonstrating that performance saturates at r=8 with minimal additional parameters. Regarding early-training confidence, we will add an experiment where routing is activated only after 10 epochs (using a warm-up phase with standard attention) and compare against using initial random or ground-truth-like confidences. This will confirm that the mechanism is robust to initial prediction noise, as the dual-branch schedule allows the model to learn stable routing gradually. revision: yes
Circularity Check
No circularity; empirical gains on external benchmarks with independent validation
full rationale
The paper introduces a pairwise routing mechanism in DETR decoders that labels query relations from appearance similarity, confidence, and geometry, then injects low-rank biases only during training via a dual-branch setup. All reported results (e.g., +1.7% mAP over DINO on COCO with ResNet-50, 57.6% mAP with Swin-L) are measured on held-out external datasets rather than derived from or fitted inside the routing equations themselves. No self-citations, uniqueness theorems, or ansatzes are invoked to close the derivation; the central claim remains an architectural proposal whose effectiveness is tested externally and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard scaled dot-product self-attention remains the core operation
invented entities (2)
-
Suppressor routing
no independent evidence
-
Delegator routing
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Uniocc: A uni- fied benchmark for occupancy forecasting and prediction in au- tonomous driving,
Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, and Jiachen Li, “Uniocc: A uni- fied benchmark for occupancy forecasting and prediction in au- tonomous driving,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). 2025, IEEE
work page 2025
-
[2]
Rich feature hierarchies for accurate object detection and semantic segmentation,
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 580–587, 2013
work page 2013
-
[3]
Faster r-cnn: Towards real-time object detection with region proposal networks,
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., 2016
work page 2016
-
[4]
You only look once: Unified, real-time object detection,
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 779– 788, 2015
work page 2015
-
[5]
End-to- end object detection with transformers,
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to- end object detection with transformers,” inEur . Conf. Comput. Vis., 2020
work page 2020
-
[6]
Deformable detr: Deformable transformers for end-to-end object detection,
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” inInt. Conf. Learn. Represent., 2021
work page 2021
-
[7]
Ease-detr: Easing the competition among object queries,
Yulu Gao, Yifan Sun, Xudong Ding, Chuyang Zhao, and Si Liu, “Ease-detr: Easing the competition among object queries,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 17282–17291
work page 2024
-
[8]
Dn-detr: Accelerate detr training by introducing query denoising,
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022
work page 2022
-
[9]
Dino: Detr with improved denoising anchor boxes for end-to-end object detec- tion,
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detec- tion,” inInt. Conf. Learn. Represent., 2023
work page 2023
-
[10]
Dac-detr: Divide the attention layers and conquer,
Zhengdong Hu, Yifan Sun, Jingdong Wang, and Yi Yang, “Dac-detr: Divide the attention layers and conquer,”Adv. Neu- ral Inform. Process. Syst., 2024
work page 2024
-
[11]
Lp-detr: Layer-wise progressive relation for object detection,
Zhengjian Kang, Ye Zhang, Xiaoyu Deng, Xintao Li, and Yongzhe Zhang, “Lp-detr: Layer-wise progressive relation for object detection,” inInternational Conference on Intelligent Computing. Springer, 2025, pp. 144–156
work page 2025
-
[12]
Qr-detr: Query routing for detection transformer,
Tharsan Senthivel and Ngoc-Son Vu, “Qr-detr: Query routing for detection transformer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 354–371
work page 2024
-
[13]
Microsoft coco: Common objects in context,
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Eur . Conf. Comput. Vis., 2014
work page 2014
-
[14]
The cityscapes dataset for semantic urban scene understanding,
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Re- hfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele, “The cityscapes dataset for semantic urban scene understanding,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3213–3223
work page 2016
-
[15]
Dab-detr: Dynamic an- chor boxes are better queries for detr,
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang, “Dab-detr: Dynamic an- chor boxes are better queries for detr,” inInt. Conf. Learn. Represent., 2022
work page 2022
-
[16]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016
work page 2016
-
[17]
Swin transformer: Hierarchical vision transformer using shifted windows,
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inInt. Conf. Comput. Vis., 2021
work page 2021
-
[18]
Imagenet: A large-scale hierarchical image database,
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE Conf. Comput. Vis. Pattern Recog., 2009
work page 2009
-
[19]
Decoupled weight decay regularization,
Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInt. Conf. Learn. Represent., 2017
work page 2017
-
[20]
Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Wei- hong Lin, Lei Sun, Chao Zhang, and Han Hu, “Detrs with hy- brid matching,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023
work page 2023
-
[21]
Group detr: Fast detr training with group- wise one-to-many assignment,
Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, and Jingdong Wang, “Group detr: Fast detr training with group- wise one-to-many assignment,” inInt. Conf. Comput. Vis., 2023
work page 2023
-
[22]
Salience detr: Enhancing detection trans- former with hierarchical salience filtering refinement,
Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, and Badong Chen, “Salience detr: Enhancing detection trans- former with hierarchical salience filtering refinement,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024
work page 2024
-
[23]
Rank-detr for high quality object detection,
Yifan Pu, Weicong Liang, Yiduo Hao, Yuhui Yuan, Yukang Yang, Chao Zhang, Han Hu, and Gao Huang, “Rank-detr for high quality object detection,”Adv. Neural Inform. Process. Syst., 2024
work page 2024
-
[24]
Detection transformer with stable matching,
Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, et al., “Detection transformer with stable matching,” inInt. Conf. Comput. Vis., 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.