pith. machine review for the scientific record. sign in

arxiv: 2203.03605 · v4 · submitted 2022-03-07 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords object detectionDETRend-to-end detectiondenoising anchor boxescontrastive trainingCOCO benchmarktransformer detectors
0
0 comments X

The pith

DINO advances DETR-style end-to-end object detection by adding contrastive denoising, mixed query selection, and look-forward-twice box prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DINO as a refinement of prior DETR-like detectors. It replaces standard denoising with a contrastive version, initializes anchors via mixed query selection, and applies a look-forward-twice scheme to refine box predictions. These changes produce faster convergence and higher accuracy on COCO, reaching 49.4 AP after 12 epochs and 51.3 AP after 24 epochs with a ResNet-50 backbone. A sympathetic reader would care because the work moves object detection closer to a fully end-to-end pipeline that avoids the many hand-tuned stages found in traditional detectors.

Core claim

DINO improves over previous DETR-like models through contrastive denoising training, mixed query selection for anchor initialization, and a look-forward-twice scheme for box prediction. These yield 49.4 AP in 12 epochs and 51.3 AP in 24 epochs on COCO with ResNet-50 and multi-scale features, outperforming the prior best DETR variant by 6.0 and 2.7 AP respectively. The model further scales to 63.2 AP on COCO val2017 and 63.3 AP on test-dev after pre-training on Objects365 with a SwinL backbone, while using smaller model size and less pre-training data than competing approaches.

What carries the argument

The central mechanisms are contrastive denoising training that distinguishes positive and negative queries, mixed query selection that combines top-k and random queries for anchor initialization, and the look-forward-twice scheme that predicts bounding boxes twice within the decoder.

If this is right

  • End-to-end detectors can reach high accuracy with substantially fewer training epochs than earlier transformer-based models.
  • The same architecture scales effectively when model size or training data volume increases.
  • State-of-the-art COCO results become attainable with smaller model footprints and reduced pre-training data compared to prior leaders.
  • Detection pipelines can drop many post-processing steps while still improving accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The denoising and query-selection ideas could be tested on other dense-prediction tasks such as instance segmentation or keypoint estimation.
  • Faster convergence may reduce the compute cost of adapting the detector to new domains or datasets.
  • The emphasis on anchor-box refinement suggests similar mechanisms might help stabilize training in related transformer vision architectures.

Load-bearing premise

The reported accuracy gains are produced by the three proposed techniques rather than by differences in hyper-parameter tuning or other unisolated implementation choices.

What would settle it

An ablation experiment that removes contrastive denoising, mixed query selection, or the look-forward-twice scheme and shows performance no higher than the prior DN-DETR baseline would falsify the claim that these components are decisive.

read the original abstract

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DINO, a DETR-based end-to-end object detector that incorporates three key improvements: contrastive denoising for training, mixed query selection for anchor initialization, and a look-forward-twice scheme for box prediction. On the COCO dataset with ResNet-50 backbone, DINO achieves 49.4 AP in 12 epochs and 51.3 AP in 24 epochs, representing gains of 6.0 AP and 2.7 AP over DN-DETR. It further demonstrates strong scaling performance, reaching 63.2 AP on COCO val2017 after pre-training on Objects365 with a Swin-L backbone.

Significance. Should the gains be attributable to the proposed architectural and training innovations rather than differences in experimental setup, this work would constitute a meaningful step forward in making DETR-like detectors competitive with traditional two-stage methods in both accuracy and training efficiency. The emphasis on scaling with model and data size, combined with the promise of code release, adds to its potential impact in the computer vision community.

major comments (2)
  1. [Experiments section] Experiments section (and associated tables comparing to DN-DETR): The paper does not report a controlled re-implementation of the DN-DETR baseline using the exact same training schedule, data augmentations, optimizer, and hyper-parameters as DINO. Given the known sensitivity of DETR-family models to these factors, the reported +6.0 AP (12-epoch) and +2.7 AP (24-epoch) improvements cannot be confidently attributed solely to contrastive denoising, mixed query selection, and look-forward-twice rather than training-recipe differences. A matched re-run of DN-DETR is required to support the central claim.
  2. [Ablation studies] Ablation studies (e.g., Table 3 or equivalent): The internal ablations compare DINO variants but do not isolate the contribution of each proposed component against a fixed training recipe; this makes it difficult to determine whether the full gain arises from the combination of techniques or from cumulative hyper-parameter adjustments.
minor comments (2)
  1. [Abstract] The abstract states 'multi-scale features' for the ResNet-50 results; the main text should explicitly confirm whether this matches the multi-scale setup in DN-DETR or introduces additional differences.
  2. [Figures] Figure captions and method diagrams would benefit from clearer labeling of the three proposed components to aid readers in tracing their implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of DINO's contributions. We address each major comment in detail below, providing clarifications and committing to revisions where appropriate to strengthen the experimental evidence.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (and associated tables comparing to DN-DETR): The paper does not report a controlled re-implementation of the DN-DETR baseline using the exact same training schedule, data augmentations, optimizer, and hyper-parameters as DINO. Given the known sensitivity of DETR-family models to these factors, the reported +6.0 AP (12-epoch) and +2.7 AP (24-epoch) improvements cannot be confidently attributed solely to contrastive denoising, mixed query selection, and look-forward-twice rather than training-recipe differences. A matched re-run of DN-DETR is required to support the central claim.

    Authors: We agree that controlled re-implementations are essential given the sensitivity of DETR models to training details. DINO was developed by extending the official DN-DETR implementation and adopts the identical training schedule (including 12- and 24-epoch settings), data augmentations, optimizer (AdamW), learning rate schedule, and all other hyper-parameters as specified in the DN-DETR paper and our Section 4.1. To directly address the concern, we will add a matched re-run of DN-DETR under our exact training pipeline to the revised experiments section and tables, which will confirm that the gains originate from the three proposed components rather than recipe differences. revision: yes

  2. Referee: [Ablation studies] Ablation studies (e.g., Table 3 or equivalent): The internal ablations compare DINO variants but do not isolate the contribution of each proposed component against a fixed training recipe; this makes it difficult to determine whether the full gain arises from the combination of techniques or from cumulative hyper-parameter adjustments.

    Authors: We acknowledge the need for clearer isolation in ablations. Our Table 3 (and related ablation tables) begins from the DN-DETR baseline and adds components one at a time—contrastive denoising, mixed query selection, and look-forward-twice—while holding the training recipe, augmentations, optimizer, and hyper-parameters completely fixed across all variants. We will revise the ablation section and table captions to explicitly state this fixed-recipe protocol and include an additional row showing the cumulative effect, making the contribution of each module unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out benchmarks with independent architectural claims

full rationale

The paper proposes three concrete architectural modifications (contrastive denoising training, mixed query selection for anchor initialization, and look-forward-twice box prediction) to the DETR family and reports measured AP on standard COCO val2017/test-dev splits after fixed training schedules. These are not derived quantities; they are direct experimental outcomes. No equation or claim reduces a reported performance number to a fitted constant or to a self-citation by construction. Prior work (DN-DETR) is cited only for baseline comparison; the central claims rest on the new components and their ablations within the DINO implementation, not on any uniqueness theorem or ansatz imported from the authors' earlier papers. Hyper-parameter sensitivity is a validity concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical engineering contribution. It introduces no new mathematical axioms, free parameters beyond standard training hyper-parameters, or invented physical entities.

pith-pipeline@v0.9.0 · 5597 in / 1097 out tokens · 24868 ms · 2026-05-12T19:47:57.559539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DINO achieves 49.4 AP in 12 epochs and 51.3 AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +6.0 AP and +2.7 AP, respectively, compared to DN-DETR

  • IndisputableMonolith.Foundation.LawOfExistence law_of_existence unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    WD-FQDet decouples modality-shared and modality-specific features in infrared-visible images via wavelet-based frequency decomposition and frequency-aware query selection to achieve state-of-the-art detection performance.

  2. Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.

  3. Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    OT-Bridge Editor uses geometrically constrained entropic optimal transport to synthesize CAG images with precise stenosis, improving downstream detection by 27.8% on ARCADE and 23.0% on a multi-center dataset.

  4. SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    cs.CV 2026-05 unverdicted novelty 7.0

    SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection...

  5. DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.

  6. DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

    cs.CV 2026-04 unverdicted novelty 7.0

    DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.

  7. Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

    cs.CL 2026-04 unverdicted novelty 7.0

    DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.

  8. WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

    cs.CV 2026-04 unverdicted novelty 7.0

    WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

  9. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  10. SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.

  11. SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

  12. Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

    cs.CV 2026-05 unverdicted novelty 6.0

    OT-Bridge Editor reframes localized image editing as a constrained entropic optimal transport problem to generate synthetic coronary angiograms that boost downstream stenosis detection by 27.8% on ARCADE and 23.0% on ...

  13. Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.

  14. Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness

    cs.CV 2026-05 unverdicted novelty 6.0

    RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.

  15. VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.

  16. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  17. ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    ZoomSpec achieves 78.1 mAP@0.5:0.95 on the SpaceNet dataset by combining log-space STFT, a coarse proposal net, adaptive heterodyne filtering, and dual-domain fine recognition to improve narrowband visibility in wideb...

  18. Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

    cs.CV 2026-04 unverdicted novelty 6.0

    VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.

  19. VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

  20. Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.

  21. PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

    cs.CV 2026-04 unverdicted novelty 6.0

    PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.

  22. Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.

  23. SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

    cs.CV 2026-04 unverdicted novelty 5.0

    SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

  24. FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer

    cs.CV 2026-04 unverdicted novelty 5.0

    FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.

  25. Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight b...

  26. A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

    cs.CV 2026-04 unverdicted novelty 5.0

    WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M...

  27. OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

    cs.RO 2026-04 unverdicted novelty 5.0

    OMNI-PoseX presents a unified vision model using open-vocabulary perception and SO(3)-aware reflected flow matching to deliver state-of-the-art 6D pose estimation with real-time performance for embodied tasks.

  28. AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes

    cs.CV 2026-05 unverdicted novelty 4.0

    AMIEOD combines a multi-expert enhancement module with detection-guided regression and selection losses to raise object detection accuracy in low-illumination images.

  29. FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.

  30. VGGT-SLAM++

    cs.CV 2026-04 unverdicted novelty 4.0

    VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 29 Pith papers · 7 internal anchors

  1. [1]

    Papers with code - coco test-dev benchmark (object detection)

  2. [2]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Op- timal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 , 2020

  3. [3]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision , pages 213–229. Springer, 2020

  4. [4]

    Hybrid task cascade for instance segmentation

    Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974–4983, 2019

  5. [5]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 , 2019

  6. [6]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 , 2016

  7. [7]

    Dynamic head: Unifying object detection heads with attentions

    Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7373–7382, 2021

  8. [8]

    Dynamic detr: End-to-end object detection with dynamic attention

    Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang. Dynamic detr: End-to-end object detection with dynamic attention. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 2988–2997, October 2021

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009

  10. [10]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  11. [11]

    Fast convergence of detr with spatially modulated co-attention

    Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021

  12. [12]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 , 2021

  13. [13]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll´ ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 2961– 2969, 2017

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  15. [15]

    Mdetr – modulated detection for end-to-end multi-modal understanding

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr – modulated detection for end-to-end multi-modal understanding. arXiv: Computer Vision and Pattern Recognition , 2021

  16. [16]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

  17. [17]

    Dn- detr: Accelerate detr training by introducing query denoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn- detr: Accelerate detr training by introducing query denoising. arXiv preprint arXiv:2203.01305, 2022. DINO: DETR with Improved DeNoising Anchor Boxes 17

  18. [18]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 , 2021

  19. [19]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020

  20. [20]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

  21. [21]

    Dab-detr: Dynamic anchor boxes are better queries for detr,

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 , 2022

  22. [22]

    Swin transformer v2: Scaling up capacity and resolution, 2022

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. arXiv preprint arXiv:2111.09883 , 2021

  23. [23]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

  24. [24]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  25. [25]

    Conditional detr for fast training convergence

    Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. arXiv preprint arXiv:2108.06152, 2021

  26. [26]

    Mixed precision training, 2018

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018

  27. [27]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. In International Conference on Machine Learning , 2021

  28. [28]

    Yolo9000: better, faster, stronger

    Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7263– 7271, 2017

  29. [29]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

  30. [30]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural in- formation processing systems , 28, 2015

  31. [31]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(6):1137–1149, 2017

  32. [32]

    Generalized intersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 658–666, 2019

  33. [33]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 18 H. Zhang, F. Li, S. Liu, et al

  34. [34]

    Rethink- ing transformer-based set prediction for object detection

    Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani. Rethink- ing transformer-based set prediction for object detection. arXiv preprint arXiv:2011.10881, 2020

  35. [35]

    Fcos: Fully convolutional one- stage object detection

    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one- stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9627–9636, 2019

  36. [36]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998–6008, 2017

  37. [37]

    Anchor detr: Query design for transformer-based detector

    Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. arXiv preprint arXiv:2109.07107 , 2021

  38. [38]

    End-to-end semi-supervised object detection with soft teacher

    Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021

  39. [39]

    Efficient detr: improving end-to-end object detector with dense prior

    Zhuyu Yao, Jiangbo Ai, Boxun Li, and Chi Zhang. Efficient detr: Improving end- to-end object detector with dense prior. arXiv preprint arXiv:2104.01318 , 2021

  40. [40]

    Florence: A new foundation model for computer vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 , 2021

  41. [41]

    De- formable detr: Deformable transformers for end-to-end object detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. De- formable detr: Deformable transformers for end-to-end object detection. In ICLR 2021: The Ninth International Conference on Learning Representations , 2021. DINO: DETR with Improved DeNoising Anchor Boxes 19 A Test Time Augmentations (TTA) We aim to build an end-to-end detector th...

  42. [42]

    The results demonstrate that our models are not only effective but also efficient for training

    All results are reported on 8 Nvidia A100 GPUs with ResNet-50 [14]. The results demonstrate that our models are not only effective but also efficient for training. 20 H. Zhang, F. Li, S. Liu, et al. Model Batch Size per GPU Traning Time GPU Mem. Epoch AP Faster RCNN [30]* 8 ∼ 60min/ep 13GB 108 42.0 DETR [3] 8 ∼ 16min/ep 26GB 300 41.2 Deformable DETR [41] ...

  43. [43]

    relu” batch norm type “FrozenBatchNorm2d

    adopts 300 decoder queries and 3 patterns [37], we use 300×3 = 900 decoder queries with the same computation cost. Learning schedules of our DINO with SwinL are available in the appendix. Loss function. We use the L1 loss and GIOU [32] loss for box regression and focal loss [19] with α = 0.25, γ = 2 for classification. As in DETR [3], we add auxiliary los...