pith. sign in

arxiv: 2512.13876 · v2 · submitted 2025-12-15 · 💻 cs.CV

Dual-R-DETR: Resolving Query Competition with Pairwise Routing in Transformer Decoders

Pith reviewed 2026-05-16 21:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords object detectionDETRtransformer decoderquery routingself-attentionset predictionCOCOduplicate suppression
0
0 comments X

The pith

Dual-R-DETR adds suppressor and delegator routing to DETR decoder attention to reduce duplicate object queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard DETR self-attention applies the same competition to every query pair, so several queries often converge on one object while others leave regions unexplored. Dual-R-DETR labels each pair as competitive or cooperative from appearance similarity, prediction confidence, and spatial geometry, then applies targeted low-rank biases to weaken same-object links and strengthen cross-region ones. The biases are active only in a dual-branch training pass so that inference reverts to ordinary self-attention with no added cost. On COCO the approach lifts multiple DETR backbones, including a 1.7-point mAP gain over DINO with ResNet-50 and 57.6 percent mAP with Swin-L.

Core claim

Dual-R-DETR distinguishes query-to-query relations as either competitive or cooperative based on appearance similarity, prediction confidence, and spatial geometry. It introduces suppressor routing to attenuate interactions among queries targeting the same object and delegator routing to encourage diversification across distinct regions. These behaviors are realized through lightweight, learnable low-rank biases injected into decoder self-attention, enabling asymmetric query interactions while preserving the standard attention formulation and adding no inference overhead.

What carries the argument

Pairwise routing implemented as suppressor and delegator low-rank biases injected into decoder self-attention during training.

If this is right

  • DETR variants gain accuracy on COCO and Cityscapes without extra inference cost.
  • Fewer queries collapse onto the same object, leaving more queries free to cover distinct regions.
  • The same routing can be applied to multiple DETR variants including those with Swin backbones.
  • Training uses a dual-branch strategy so the final model is identical to vanilla DETR at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing signals could be inspected after training to reveal which object features the model treats as most competitive.
  • Because the biases are removed at inference, any accuracy gain must arise solely from altered optimization dynamics rather than architectural change.
  • The approach may extend to other set-prediction transformers where uniform attention produces redundant assignments.
  • If the routing patterns stabilize late in training, the biases might be folded into the attention weights to eliminate the dual branch entirely.

Load-bearing premise

Query relations can be labeled competitive or cooperative reliably from appearance similarity, prediction confidence, and spatial geometry, and the resulting routing will steer optimization to better coverage without instability or collapse.

What would settle it

Replace the learned routing biases with zero or random values during training and measure whether mAP falls back to the level of the unmodified baseline DETR model.

read the original abstract

Detection Transformers (DETR) formulate object detection as a set prediction problem and enable end-to-end training without post-processing. However, object queries in DETR interact through symmetric self-attention, which enforces uniform competition among all query pairs. This often leads to inefficient query dynamics, where multiple queries converge to the same object while others fail to explore alternative regions. We propose Dual-R-DETR, a competition-aware DETR framework that explicitly regulates query interactions via pairwise routing in transformer decoders. Dual-R-DETR distinguishes query-to-query relations as either competitive or cooperative based on appearance similarity, prediction confidence, and spatial geometry. It introduces two complementary routing behaviors: suppressor routing to attenuate interactions among queries targeting the same object, and delegator routing to encourage diversification across distinct regions. These behaviors are realized through lightweight, learnable low-rank biases injected into decoder self-attention, enabling asymmetric query interactions while preserving the standard attention formulation. To ensure inference efficiency, routing biases are applied only during training using a dual-branch strategy, and inference reverts to vanilla self-attention with no additional computational cost. Extensive experiments on COCO and Cityscapes demonstrate that Dual-R-DETR consistently improves multiple DETR variants, outperforming DINO by 1.7% mAP with a ResNet-50 backbone and achieving 57.6% mAP with Swin-L under comparable settings. Code is available at https://github.com/YZk67/Dual-R-DETR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Dual-R-DETR, which modifies DETR-style object detectors by replacing symmetric self-attention in the decoder with pairwise routing that labels query pairs as competitive or cooperative using appearance similarity, prediction confidence, and spatial geometry. It introduces suppressor routing to dampen interactions among queries converging on the same object and delegator routing to encourage queries to explore distinct regions; both are realized as lightweight learnable low-rank biases added to the attention matrix only during training via a dual-branch schedule, reverting to standard attention at inference. Experiments on COCO and Cityscapes report consistent gains across DETR variants, including +1.7% mAP over DINO with ResNet-50 and 57.6% mAP with Swin-L.

Significance. If the routing mechanism is shown to be robust, the approach supplies a training-only, zero-cost-at-inference method for breaking query symmetry in set-prediction detectors. The public code release and reproducible gains on standard benchmarks constitute a concrete strength that would allow the community to verify and extend the empirical claims.

major comments (3)
  1. [§3.2] §3.2: The central claim that appearance similarity, confidence, and spatial geometry reliably distinguish competitive from cooperative query pairs is load-bearing for the routing mechanism, yet the manuscript provides no direct validation (e.g., qualitative analysis of mislabeled pairs or controlled experiments where similar-appearance queries target different objects). Without such evidence the observed mAP gains could arise from incidental regularization rather than the intended competition resolution.
  2. [§4.1] §4.1, Eq. (routing bias definition): the precise computation of the low-rank suppressor and delegator biases from the pairwise labels is not fully derived in the text; only the high-level injection into self-attention is described. This omission prevents independent verification of whether the biases remain parameter-efficient and numerically stable across training stages.
  3. [Table 2] Table 2 (ablation section): no ablation on bias rank or on the effect of early-training confidence scores is reported, even though the skeptic note correctly flags that unreliable predictions at initialization could produce systematically incorrect routing labels and thereby undermine the claimed mechanism.
minor comments (2)
  1. [§3] The dual-branch training schedule is mentioned in the abstract and §3 but never given an explicit diagram or pseudocode; a small figure would clarify how the routing branch is detached at inference.
  2. [Notation] Notation for the routing matrices (e.g., B_s and B_d) is introduced without a consolidated table of symbols, which slightly hinders readability for readers outside the immediate DETR literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the insightful and constructive feedback on our manuscript. We have addressed each major comment in detail below and will incorporate the suggested revisions to improve clarity and strengthen the empirical validation of our approach.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The central claim that appearance similarity, confidence, and spatial geometry reliably distinguish competitive from cooperative query pairs is load-bearing for the routing mechanism, yet the manuscript provides no direct validation (e.g., qualitative analysis of mislabeled pairs or controlled experiments where similar-appearance queries target different objects). Without such evidence the observed mAP gains could arise from incidental regularization rather than the intended competition resolution.

    Authors: We acknowledge that direct validation of the pairwise routing labels would strengthen the central claim. The current manuscript provides indirect evidence through component ablations in Table 2, demonstrating that both suppressor and delegator routing contribute to the performance gains. However, to address this concern directly, we will add qualitative visualizations in the revised manuscript showing examples of query pairs classified as competitive (high appearance similarity, overlapping predictions) versus cooperative, along with cases where similar-appearance queries are correctly delegated to distinct objects. Additionally, we will include a controlled experiment that randomizes the routing labels to isolate the effect from incidental regularization. revision: yes

  2. Referee: [§4.1] §4.1, Eq. (routing bias definition): the precise computation of the low-rank suppressor and delegator biases from the pairwise labels is not fully derived in the text; only the high-level injection into self-attention is described. This omission prevents independent verification of whether the biases remain parameter-efficient and numerically stable across training stages.

    Authors: We agree that the exact computation of the biases requires more explicit derivation for reproducibility. In the revised Section 4.1, we will provide the full mathematical formulation: the pairwise labels are first encoded into a feature vector combining appearance similarity (cosine of query features), confidence (product of query confidences), and spatial geometry (IoU of predicted boxes). This vector is then projected to low-rank matrices U and V (rank r=8) to form the bias B = U V^T, which is added to the attention logits. We will also report the parameter overhead (approximately 0.1% of the decoder) and include a stability analysis showing that the biases remain bounded due to the low-rank constraint and normalization. revision: yes

  3. Referee: [Table 2] Table 2 (ablation section): no ablation on bias rank or on the effect of early-training confidence scores is reported, even though the skeptic note correctly flags that unreliable predictions at initialization could produce systematically incorrect routing labels and thereby undermine the claimed mechanism.

    Authors: We thank the referee for this observation. We will expand the ablation study to include variations in bias rank (r=4, 8, 16), demonstrating that performance saturates at r=8 with minimal additional parameters. Regarding early-training confidence, we will add an experiment where routing is activated only after 10 epochs (using a warm-up phase with standard attention) and compare against using initial random or ground-truth-like confidences. This will confirm that the mechanism is robust to initial prediction noise, as the dual-branch schedule allows the model to learn stable routing gradually. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains on external benchmarks with independent validation

full rationale

The paper introduces a pairwise routing mechanism in DETR decoders that labels query relations from appearance similarity, confidence, and geometry, then injects low-rank biases only during training via a dual-branch setup. All reported results (e.g., +1.7% mAP over DINO on COCO with ResNet-50, 57.6% mAP with Swin-L) are measured on held-out external datasets rather than derived from or fitted inside the routing equations themselves. No self-citations, uniqueness theorems, or ansatzes are invoked to close the derivation; the central claim remains an architectural proposal whose effectiveness is tested externally and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The method rests on the standard transformer attention formulation plus the assumption that similarity-based routing labels can be computed stably from intermediate predictions. No additional free parameters beyond the model's normal weights are introduced; the low-rank biases are learned during training.

axioms (1)
  • standard math Standard scaled dot-product self-attention remains the core operation
    The paper states that routing biases are injected into the existing attention formulation without changing its mathematical structure.
invented entities (2)
  • Suppressor routing no independent evidence
    purpose: Attenuate attention among queries targeting the same object
    New behavior defined in the paper to reduce duplicate detections.
  • Delegator routing no independent evidence
    purpose: Encourage queries to explore distinct spatial regions
    New behavior defined in the paper to improve coverage.

pith-pipeline@v0.9.0 · 5572 in / 1259 out tokens · 53863 ms · 2026-05-16T21:35:31.131254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Uniocc: A uni- fied benchmark for occupancy forecasting and prediction in au- tonomous driving,

    Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, and Jiachen Li, “Uniocc: A uni- fied benchmark for occupancy forecasting and prediction in au- tonomous driving,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). 2025, IEEE

  2. [2]

    Rich feature hierarchies for accurate object detection and semantic segmentation,

    Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 580–587, 2013

  3. [3]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., 2016

  4. [4]

    You only look once: Unified, real-time object detection,

    Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 779– 788, 2015

  5. [5]

    End-to- end object detection with transformers,

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to- end object detection with transformers,” inEur . Conf. Comput. Vis., 2020

  6. [6]

    Deformable detr: Deformable transformers for end-to-end object detection,

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” inInt. Conf. Learn. Represent., 2021

  7. [7]

    Ease-detr: Easing the competition among object queries,

    Yulu Gao, Yifan Sun, Xudong Ding, Chuyang Zhao, and Si Liu, “Ease-detr: Easing the competition among object queries,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 17282–17291

  8. [8]

    Dn-detr: Accelerate detr training by introducing query denoising,

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022

  9. [9]

    Dino: Detr with improved denoising anchor boxes for end-to-end object detec- tion,

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detec- tion,” inInt. Conf. Learn. Represent., 2023

  10. [10]

    Dac-detr: Divide the attention layers and conquer,

    Zhengdong Hu, Yifan Sun, Jingdong Wang, and Yi Yang, “Dac-detr: Divide the attention layers and conquer,”Adv. Neu- ral Inform. Process. Syst., 2024

  11. [11]

    Lp-detr: Layer-wise progressive relation for object detection,

    Zhengjian Kang, Ye Zhang, Xiaoyu Deng, Xintao Li, and Yongzhe Zhang, “Lp-detr: Layer-wise progressive relation for object detection,” inInternational Conference on Intelligent Computing. Springer, 2025, pp. 144–156

  12. [12]

    Qr-detr: Query routing for detection transformer,

    Tharsan Senthivel and Ngoc-Son Vu, “Qr-detr: Query routing for detection transformer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 354–371

  13. [13]

    Microsoft coco: Common objects in context,

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Eur . Conf. Comput. Vis., 2014

  14. [14]

    The cityscapes dataset for semantic urban scene understanding,

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Re- hfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele, “The cityscapes dataset for semantic urban scene understanding,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3213–3223

  15. [15]

    Dab-detr: Dynamic an- chor boxes are better queries for detr,

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang, “Dab-detr: Dynamic an- chor boxes are better queries for detr,” inInt. Conf. Learn. Represent., 2022

  16. [16]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016

  17. [17]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inInt. Conf. Comput. Vis., 2021

  18. [18]

    Imagenet: A large-scale hierarchical image database,

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE Conf. Comput. Vis. Pattern Recog., 2009

  19. [19]

    Decoupled weight decay regularization,

    Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInt. Conf. Learn. Represent., 2017

  20. [20]

    Detrs with hy- brid matching,

    Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Wei- hong Lin, Lei Sun, Chao Zhang, and Han Hu, “Detrs with hy- brid matching,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023

  21. [21]

    Group detr: Fast detr training with group- wise one-to-many assignment,

    Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, and Jingdong Wang, “Group detr: Fast detr training with group- wise one-to-many assignment,” inInt. Conf. Comput. Vis., 2023

  22. [22]

    Salience detr: Enhancing detection trans- former with hierarchical salience filtering refinement,

    Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, and Badong Chen, “Salience detr: Enhancing detection trans- former with hierarchical salience filtering refinement,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024

  23. [23]

    Rank-detr for high quality object detection,

    Yifan Pu, Weicong Liang, Yiduo Hao, Yuhui Yuan, Yukang Yang, Chao Zhang, Han Hu, and Gao Huang, “Rank-detr for high quality object detection,”Adv. Neural Inform. Process. Syst., 2024

  24. [24]

    Detection transformer with stable matching,

    Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, et al., “Detection transformer with stable matching,” inInt. Conf. Comput. Vis., 2023