MDS-DETR: DETR with Masked Duplicate Suppressor

Chanho Lee; Junmo Kim; Seunghee Koh; Yunho Jeon

arxiv: 2605.23507 · v1 · pith:P473WC25new · submitted 2026-05-22 · 💻 cs.CV

MDS-DETR: DETR with Masked Duplicate Suppressor

Chanho Lee , Seunghee Koh , Yunho Jeon , Junmo Kim This is my paper

Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords DETRobject detectionone-to-many matchingduplicate suppressionself-attentionMS COCOtransformer detectorend-to-end detection

0 comments

The pith

MDS-DETR combines one-to-one and one-to-many supervision in a single decoder by using confidence-based causal masking to suppress duplicates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that DETR's slow convergence and low recall from strict one-to-one matching can be fixed by adding one-to-many label assignment inside the same decoder. It does this through a Masked Duplicate Suppressor that applies confidence-based causal masking to self-attention, removing only the extra duplicate predictions created by the one-to-many signals. A sympathetic reader would care because the method avoids the extra decoders or queries used in prior work, delivering +2.8 mAP over Deformable-DETR on MS COCO with a 12-epoch ResNet-50 schedule and only a 5% training-time increase while also beating MR.DETR by 0.3 mAP at 20% lower training cost.

Core claim

MDS-DETR integrates one-to-many supervision directly into the main decoder by injecting asymmetry through confidence-based causal masking in self-attention; this filters duplicates generated by the one-to-many layer, preserves one-to-one matching benefits at inference, and produces duplicate-free predictions without auxiliary components or extra queries.

What carries the argument

The Masked Duplicate Suppressor (MDS), which applies confidence-based causal masking to self-attention to suppress duplicates from the one-to-many supervised layer.

If this is right

Achieves +2.8 mAP over Deformable-DETR on MS COCO with ResNet-50 under a 12-epoch schedule and only 5% added training time.
Outperforms MR.DETR by +0.3 mAP while training 20% faster.
Requires no additional queries or auxiliary decoders.
Produces explainable, duplicate-free predictions inside a fully end-to-end framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The asymmetry-injection idea could be tested on other transformer detectors that suffer from duplicate predictions.
Similar masking might be applied at different layers or to cross-attention to further reduce training overhead.
The approach may scale more easily to larger backbones because it avoids the cost of extra decoders.

Load-bearing premise

Confidence-based causal masking reliably removes only duplicates from one-to-many supervision without discarding valid detections or lowering overall recall.

What would settle it

An evaluation on a crowded-scene subset of MS COCO or similar data where recall falls below the Deformable-DETR baseline after applying the masking.

Figures

Figures reproduced from arXiv: 2605.23507 by Chanho Lee, Junmo Kim, Seunghee Koh, Yunho Jeon.

**Figure 2.** Figure 2: FIGURE 2: Visualization of masked self-attention heatmaps [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4: Qualitative results on COCO [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, MR.DETR and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5\% increase in training time, and outperforms the state-of-the-art MR.DETR by +0.3 mAP while being even 20\% faster in training. Our code and models are available at \href{https://github.com/dcholee/mds-detr}{https://github.com/DChoLee/MDS-DETR}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDS-DETR puts one-to-many supervision into the main decoder via confidence-based causal masking in self-attention, delivering modest COCO gains without extra decoders.

read the letter

MDS-DETR keeps one-to-many supervision inside the main decoder by adding a Masked Duplicate Suppressor that applies confidence-based causal masking to self-attention. This is the main new piece compared to prior one-to-many DETR work like MR.DETR or MS-DETR. The paper does a decent job laying out how the masking creates the needed asymmetry so that the one-to-many layer can provide more positives during training without leaving duplicates at inference. They report a +2.8 mAP lift over Deformable-DETR on COCO with ResNet-50 under a 12-epoch schedule, at only 5% extra training time, and a small edge over MR.DETR while being faster. Releasing the code is helpful for checking the details. The potential issue is whether that masking step really targets only duplicates. If the confidence threshold ends up dropping some correct low-confidence detections, the recall could drop and the mAP gain might depend on careful tuning rather than the method itself. The abstract does not include direct measurements of the pre- and post-masking sets or recall changes, so it is hard to judge how cleanly the suppression works. The soundness looks okay for an empirical paper but would benefit from more error analysis. This paper is aimed at researchers tuning DETR-style detectors for better convergence without extra modules. A reader interested in practical improvements to end-to-end object detection would get value from the numbers and the released implementation. It deserves a serious referee because the results are on a standard benchmark and the approach is simple enough to test. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes MDS-DETR, a single-decoder DETR variant that combines one-to-one and one-to-many supervision by introducing a Masked Duplicate Suppressor (MDS) module. MDS injects asymmetry into self-attention via confidence-based causal masking to suppress duplicates generated by the one-to-many layer at inference time, enabling end-to-end duplicate-free predictions without auxiliary decoders or extra queries. On MS COCO with ResNet-50 under a 12-epoch schedule, it reports +2.8 mAP over Deformable-DETR (with 5% training time increase) and +0.3 mAP over MR.DETR (while being 20% faster).

Significance. If the central attribution holds, the result would be significant for DETR literature: it offers a simpler, single-decoder route to the benefits of one-to-many assignment while preserving end-to-end inference, with concrete efficiency gains on a standard benchmark. The availability of code and models strengthens reproducibility.

major comments (3)

[MDS description] MDS description (method section): the claim that confidence-based causal masking 'filters out the duplicates generated by the one-to-many supervised layer' without discarding valid detections rests on an untested asymmetry assumption; no pre-/post-masking detection-set comparison, recall delta, or low-confidence correct-query analysis is provided to confirm the mask acts selectively.
[Experiments] Experiments (results section): the +2.8 mAP and +0.3 mAP claims are presented as direct outcomes of MDS, yet the manuscript provides no ablation that isolates the masking threshold or causal mask from other training choices, leaving open whether gains are robustly attributable to the proposed component rather than hyperparameter variation.
[COCO results table/figure] Table/figure on COCO results: without reported recall or duplicate-count metrics before versus after MDS, it is impossible to verify that the observed mAP improvement does not trade off recall for precision, which is load-bearing for the 'duplicate-free yet high-recall' central claim.

minor comments (2)

[Method] Notation for the causal mask and confidence threshold is introduced without an explicit equation or pseudocode, making the exact implementation of the asymmetry injection difficult to reproduce from the text alone.
[Abstract] The abstract states 'our code and models are available' but the manuscript does not include a direct link or commit hash in the main body, which is a minor reproducibility detail.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback identifies specific areas where additional empirical support would strengthen the central claims regarding the Masked Duplicate Suppressor. We address each point below and will incorporate the requested analyses and ablations into the revised manuscript.

read point-by-point responses

Referee: [MDS description] MDS description (method section): the claim that confidence-based causal masking 'filters out the duplicates generated by the one-to-many supervised layer' without discarding valid detections rests on an untested asymmetry assumption; no pre-/post-masking detection-set comparison, recall delta, or low-confidence correct-query analysis is provided to confirm the mask acts selectively.

Authors: We agree that direct empirical verification of the masking mechanism's selectivity would strengthen the method section. In the revision we will add pre- and post-masking detection-set comparisons on the validation set, report recall deltas, and include an analysis of low-confidence correct queries to demonstrate that the confidence-based causal mask primarily suppresses duplicates while preserving valid detections. revision: yes
Referee: [Experiments] Experiments (results section): the +2.8 mAP and +0.3 mAP claims are presented as direct outcomes of MDS, yet the manuscript provides no ablation that isolates the masking threshold or causal mask from other training choices, leaving open whether gains are robustly attributable to the proposed component rather than hyperparameter variation.

Authors: We acknowledge the value of isolating the contribution of the masking threshold and causal mask. The revised manuscript will include a dedicated ablation table that varies the masking threshold while holding all other training hyperparameters fixed, and compares performance with and without the causal masking component, to confirm that the reported gains are attributable to MDS rather than incidental hyperparameter choices. revision: yes
Referee: [COCO results table/figure] Table/figure on COCO results: without reported recall or duplicate-count metrics before versus after MDS, it is impossible to verify that the observed mAP improvement does not trade off recall for precision, which is load-bearing for the 'duplicate-free yet high-recall' central claim.

Authors: We recognize that recall and duplicate-count metrics are necessary to substantiate the central claim. In the revision we will augment the COCO results table and any associated figure with recall values and duplicate-count statistics computed before and after MDS application, thereby allowing direct verification that mAP gains arise without sacrificing recall. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims are empirical measurements on public benchmark

full rationale

The paper introduces an architectural change (single-decoder one-to-one plus one-to-many supervision with confidence-based causal masking) and reports measured mAP and training-time numbers on MS COCO. These numbers are direct experimental outcomes rather than quantities derived from equations, fitted parameters renamed as predictions, or self-referential definitions. No mathematical derivation chain, self-definitional steps, or load-bearing self-citations appear in the provided text; the central claim therefore does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard transformer training assumptions plus the new architectural component; no explicit free parameters beyond typical model hyperparameters are named in the abstract.

axioms (1)

domain assumption One-to-many label assignment improves recall but generates duplicates that must be suppressed post hoc
Invoked in the abstract as the motivation for the MDS component.

invented entities (1)

Masked Duplicate Suppressor (MDS) no independent evidence
purpose: Inject asymmetry into self-attention via confidence-based causal masking to filter duplicates from one-to-many supervision
New component introduced to enable single-decoder training

pith-pipeline@v0.9.0 · 5822 in / 1294 out tokens · 24232 ms · 2026-05-25T04:22:25.285084+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag, 2020, p. 213–229. [Online]. Available: https://doi.org/ 10.1007/978-3-030-58452-8_13

work page doi:10.1007/978-3-030-58452-8_13 2020
[2]

Detrs with hybrid matching,

D. Jia, Y . Yuan, H. He, X. Wu, H. Yu, W. Lin, L. Sun, C. Zhang, and H. Hu, “Detrs with hybrid matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 702–19 712

work page 2023
[3]

Group detr: Fast detr training with group-wise VOLUME 4, 2016 9 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS one-to-many assignment,

Q. Chen, X. Chen, J. Wang, S. Zhang, K. Yao, H. Feng, J. Han, E. Ding, G. Zeng, and J. Wang, “Group detr: Fast detr training with group-wise VOLUME 4, 2016 9 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS one-to-many assignment,” in Proceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2023

work page 2016
[4]

Detrs with collaborative hybrid assign- ments training,

Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid assign- ments training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 6748–6758

work page 2023
[5]

Rank-DETR for high quality object detection,

Y . Pu, W. Liang, Y . Hao, Y . Yuan, Y . Yang, C. Zhang, H. Hu, and G. Huang, “Rank-DETR for high quality object detection,” in Thirty-seventh Conference on Neural Information Processing Systems,

work page
[6]

Available: https://openreview.net/forum?id=WUott1ZvRj

[Online]. Available: https://openreview.net/forum?id=WUott1ZvRj

work page
[7]

Relation detr: Exploring explicit position relation prior for object detection,

X. Hou, M. Liu, S. Zhang, P. Wei, B. Chen, and X. Lan, “Relation detr: Exploring explicit position relation prior for object detection,” in European conference on computer vision. Springer, 2024

work page 2024
[8]

Mr. detr: Instructive multi-route training for detection transformers,

C.-B. Zhang, Y . Zhong, and K. Han, “Mr. detr: Instructive multi-route training for detection transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[9]

Deformable-detr: Deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable-detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke

work page 2021
[10]

DINO: DETR with improved denoising anchor boxes for end-to-end object detection,

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y . Shum, “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=3mRwyG5one

work page 2023
[11]

Ease-detr: Easing the competition among object queries,

Y . Gao, Y . Sun, X. Ding, C. Zhao, and S. Liu, “Ease-detr: Easing the competition among object queries,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 17 282– 17 291

work page 2024
[12]

Microsoft COCO: common objects in context,

T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, ser. Lecture Notes in Computer Science, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars...

work page doi:10.1007/978-3-319-10602-1_48 2014
[13]

Ms- detr: Efficient detr training with mixed supervision,

C. Zhao, Y . Sun, W. Wang, Q. Chen, E. Ding, Y . Yang, and J. Wang, “Ms- detr: Efficient detr training with mixed supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 027–17 036

work page 2024
[14]

An end-to-end transformer model for 3d object detection,

I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for 3d object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2906–2917

work page 2021
[15]

Masked- attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked- attention mask transformer for universal image segmentation,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

work page 2022
[16]

Track- former: Multi-object tracking with transformers,

T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track- former: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8844–8854

work page 2022
[17]

Cmt-deeplab: Clustering mask transformers for panoptic segmentation,

Q. Yu, H. Wang, D. Kim, S. Qiao, M. Collins, Y . Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Cmt-deeplab: Clustering mask transformers for panoptic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2560–2570

work page 2022
[18]

Pose recognition with cascade transformers,

K. Li, S. Wang, X. Zhang, Y . Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1944–1953

work page 2021
[19]

DAB-DETR: Dynamic anchor boxes are better queries for DETR,

S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=oMI9PjOb9Jl

work page 2022
[20]

Dn-detr: Accelerate detr training by introducing query denoising,

F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627

work page 2022
[21]

Anchor detr: Query design for transformer-based detector,

Y . Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2567–2575, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/20158

work page 2022
[22]

Fast convergence of detr with spatially modulated co-attention,

P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of detr with spatially modulated co-attention,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3621–3630

work page 2021
[23]

Salience detr: Enhancing detection transformer with hierarchical salience filtering refinement,

X. Hou, M. Liu, S. Zhang, P. Wei, and B. Chen, “Salience detr: Enhancing detection transformer with hierarchical salience filtering refinement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 17 574–17 583

work page 2024
[24]

Dac-detr: Divide the attention layers and conquer,

Z. Hu, Y . Sun, J. Wang, and Y . Yang, “Dac-detr: Divide the attention layers and conquer,” in NeurIPS, 2023. [Online]. Available: http://papers.nips.cc/paper_files/paper/2023/hash/ edd0d433f8a1a51aa11237a6543fc280-Abstract-Conference.html

work page 2023
[25]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in neural information processing systems, vol. 33, pp. 21 002–21 012, 2020

work page 2020
[26]

Varifocalnet: An iou- aware dense object detector,

H. Zhang, Y . Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou- aware dense object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8514–8523

work page 2021
[27]

Iou-aware single-stage object detector for accurate localization,

S. Wu, X. Li, and X. Wang, “Iou-aware single-stage object detector for accurate localization,” Image and Vision Computing, vol. 97, p. 103911, 2020

work page 2020
[28]

Tood: Task- aligned one-stage object detection,

C. Feng, Y . Zhong, Y . Gao, M. R. Scott, and W. Huang, “Tood: Task- aligned one-stage object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2021, pp. 3490–3499

work page 2021
[29]

Align-detr: Improving detr with simple iou-aware bce loss,

Z. Cai, S. Liu, G. Wang, Z. Ge, X. Zhang, and D. Huang, “Align-detr: Improving detr with simple iou-aware bce loss,” 2023

work page 2023
[30]

Nms strikes back,

J. Ouyang-Zhang, J. H. Cho, X. Zhou, and P. Krähenbühl, “Nms strikes back,” arXiv preprint arXiv:2212.06137, 2022

work page arXiv 2022
[31]

Dense distinct query for end-to-end object detection,

S. Zhang, X. Wang, J. Wang, J. Pang, C. Lyu, W. Zhang, P. Luo, and K. Chen, “Dense distinct query for end-to-end object detection,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 7329–7338

work page 2023
[32]

Towards data-efficient detection transformers,

W. Wang, J. Zhang, Y . Cao, Y . Shen, and D. Tao, “Towards data-efficient detection transformers,” in Proc. Eur. Conf. Computer Vision (ECCV), 2022

work page 2022
[33]

Relation networks for object detection,

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y . Wei, “Relation networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3588–3597

work page 2018
[34]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vi- sion, 2021, pp. 10 012–10 022

work page 2021
[35]

Detection transformer with stable matching,

S. Liu, T. Ren, J. Chen, Z. Zeng, H. Zhang, F. Li, H. Li, J. Huang, H. Su, J. Zhu, and L. Zhang, “Detection transformer with stable matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 6491–6500

work page 2023
[36]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[37]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

work page 2009
[38]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[39]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 CHANHO LEEreceived the B.S., M.S., and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Tech- nology (KAIST), Daejeon, S...

work page 2019

[1] [1]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag, 2020, p. 213–229. [Online]. Available: https://doi.org/ 10.1007/978-3-030-58452-8_13

work page doi:10.1007/978-3-030-58452-8_13 2020

[2] [2]

Detrs with hybrid matching,

D. Jia, Y . Yuan, H. He, X. Wu, H. Yu, W. Lin, L. Sun, C. Zhang, and H. Hu, “Detrs with hybrid matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 702–19 712

work page 2023

[3] [3]

Group detr: Fast detr training with group-wise VOLUME 4, 2016 9 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS one-to-many assignment,

Q. Chen, X. Chen, J. Wang, S. Zhang, K. Yao, H. Feng, J. Han, E. Ding, G. Zeng, and J. Wang, “Group detr: Fast detr training with group-wise VOLUME 4, 2016 9 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS one-to-many assignment,” in Proceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2023

work page 2016

[4] [4]

Detrs with collaborative hybrid assign- ments training,

Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid assign- ments training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 6748–6758

work page 2023

[5] [5]

Rank-DETR for high quality object detection,

Y . Pu, W. Liang, Y . Hao, Y . Yuan, Y . Yang, C. Zhang, H. Hu, and G. Huang, “Rank-DETR for high quality object detection,” in Thirty-seventh Conference on Neural Information Processing Systems,

work page

[6] [6]

Available: https://openreview.net/forum?id=WUott1ZvRj

[Online]. Available: https://openreview.net/forum?id=WUott1ZvRj

work page

[7] [7]

Relation detr: Exploring explicit position relation prior for object detection,

X. Hou, M. Liu, S. Zhang, P. Wei, B. Chen, and X. Lan, “Relation detr: Exploring explicit position relation prior for object detection,” in European conference on computer vision. Springer, 2024

work page 2024

[8] [8]

Mr. detr: Instructive multi-route training for detection transformers,

C.-B. Zhang, Y . Zhong, and K. Han, “Mr. detr: Instructive multi-route training for detection transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[9] [9]

Deformable-detr: Deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable-detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke

work page 2021

[10] [10]

DINO: DETR with improved denoising anchor boxes for end-to-end object detection,

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y . Shum, “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=3mRwyG5one

work page 2023

[11] [11]

Ease-detr: Easing the competition among object queries,

Y . Gao, Y . Sun, X. Ding, C. Zhao, and S. Liu, “Ease-detr: Easing the competition among object queries,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 17 282– 17 291

work page 2024

[12] [12]

Microsoft COCO: common objects in context,

T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, ser. Lecture Notes in Computer Science, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars...

work page doi:10.1007/978-3-319-10602-1_48 2014

[13] [13]

Ms- detr: Efficient detr training with mixed supervision,

C. Zhao, Y . Sun, W. Wang, Q. Chen, E. Ding, Y . Yang, and J. Wang, “Ms- detr: Efficient detr training with mixed supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 027–17 036

work page 2024

[14] [14]

An end-to-end transformer model for 3d object detection,

I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for 3d object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2906–2917

work page 2021

[15] [15]

Masked- attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked- attention mask transformer for universal image segmentation,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

work page 2022

[16] [16]

Track- former: Multi-object tracking with transformers,

T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track- former: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8844–8854

work page 2022

[17] [17]

Cmt-deeplab: Clustering mask transformers for panoptic segmentation,

Q. Yu, H. Wang, D. Kim, S. Qiao, M. Collins, Y . Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Cmt-deeplab: Clustering mask transformers for panoptic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2560–2570

work page 2022

[18] [18]

Pose recognition with cascade transformers,

K. Li, S. Wang, X. Zhang, Y . Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1944–1953

work page 2021

[19] [19]

DAB-DETR: Dynamic anchor boxes are better queries for DETR,

S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=oMI9PjOb9Jl

work page 2022

[20] [20]

Dn-detr: Accelerate detr training by introducing query denoising,

F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627

work page 2022

[21] [21]

Anchor detr: Query design for transformer-based detector,

Y . Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2567–2575, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/20158

work page 2022

[22] [22]

Fast convergence of detr with spatially modulated co-attention,

P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of detr with spatially modulated co-attention,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3621–3630

work page 2021

[23] [23]

Salience detr: Enhancing detection transformer with hierarchical salience filtering refinement,

X. Hou, M. Liu, S. Zhang, P. Wei, and B. Chen, “Salience detr: Enhancing detection transformer with hierarchical salience filtering refinement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 17 574–17 583

work page 2024

[24] [24]

Dac-detr: Divide the attention layers and conquer,

Z. Hu, Y . Sun, J. Wang, and Y . Yang, “Dac-detr: Divide the attention layers and conquer,” in NeurIPS, 2023. [Online]. Available: http://papers.nips.cc/paper_files/paper/2023/hash/ edd0d433f8a1a51aa11237a6543fc280-Abstract-Conference.html

work page 2023

[25] [25]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in neural information processing systems, vol. 33, pp. 21 002–21 012, 2020

work page 2020

[26] [26]

Varifocalnet: An iou- aware dense object detector,

H. Zhang, Y . Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou- aware dense object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8514–8523

work page 2021

[27] [27]

Iou-aware single-stage object detector for accurate localization,

S. Wu, X. Li, and X. Wang, “Iou-aware single-stage object detector for accurate localization,” Image and Vision Computing, vol. 97, p. 103911, 2020

work page 2020

[28] [28]

Tood: Task- aligned one-stage object detection,

C. Feng, Y . Zhong, Y . Gao, M. R. Scott, and W. Huang, “Tood: Task- aligned one-stage object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2021, pp. 3490–3499

work page 2021

[29] [29]

Align-detr: Improving detr with simple iou-aware bce loss,

Z. Cai, S. Liu, G. Wang, Z. Ge, X. Zhang, and D. Huang, “Align-detr: Improving detr with simple iou-aware bce loss,” 2023

work page 2023

[30] [30]

Nms strikes back,

J. Ouyang-Zhang, J. H. Cho, X. Zhou, and P. Krähenbühl, “Nms strikes back,” arXiv preprint arXiv:2212.06137, 2022

work page arXiv 2022

[31] [31]

Dense distinct query for end-to-end object detection,

S. Zhang, X. Wang, J. Wang, J. Pang, C. Lyu, W. Zhang, P. Luo, and K. Chen, “Dense distinct query for end-to-end object detection,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 7329–7338

work page 2023

[32] [32]

Towards data-efficient detection transformers,

W. Wang, J. Zhang, Y . Cao, Y . Shen, and D. Tao, “Towards data-efficient detection transformers,” in Proc. Eur. Conf. Computer Vision (ECCV), 2022

work page 2022

[33] [33]

Relation networks for object detection,

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y . Wei, “Relation networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3588–3597

work page 2018

[34] [34]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vi- sion, 2021, pp. 10 012–10 022

work page 2021

[35] [35]

Detection transformer with stable matching,

S. Liu, T. Ren, J. Chen, Z. Zeng, H. Zhang, F. Li, H. Li, J. Huang, H. Su, J. Zhu, and L. Zhang, “Detection transformer with stable matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 6491–6500

work page 2023

[36] [36]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[37] [37]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

work page 2009

[38] [38]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[39] [39]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 CHANHO LEEreceived the B.S., M.S., and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Tech- nology (KAIST), Daejeon, S...

work page 2019