Recognition: 2 theorem links
· Lean TheoremDINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Pith reviewed 2026-05-12 19:47 UTC · model grok-4.3
The pith
DINO advances DETR-style end-to-end object detection by adding contrastive denoising, mixed query selection, and look-forward-twice box prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DINO improves over previous DETR-like models through contrastive denoising training, mixed query selection for anchor initialization, and a look-forward-twice scheme for box prediction. These yield 49.4 AP in 12 epochs and 51.3 AP in 24 epochs on COCO with ResNet-50 and multi-scale features, outperforming the prior best DETR variant by 6.0 and 2.7 AP respectively. The model further scales to 63.2 AP on COCO val2017 and 63.3 AP on test-dev after pre-training on Objects365 with a SwinL backbone, while using smaller model size and less pre-training data than competing approaches.
What carries the argument
The central mechanisms are contrastive denoising training that distinguishes positive and negative queries, mixed query selection that combines top-k and random queries for anchor initialization, and the look-forward-twice scheme that predicts bounding boxes twice within the decoder.
If this is right
- End-to-end detectors can reach high accuracy with substantially fewer training epochs than earlier transformer-based models.
- The same architecture scales effectively when model size or training data volume increases.
- State-of-the-art COCO results become attainable with smaller model footprints and reduced pre-training data compared to prior leaders.
- Detection pipelines can drop many post-processing steps while still improving accuracy.
Where Pith is reading between the lines
- The denoising and query-selection ideas could be tested on other dense-prediction tasks such as instance segmentation or keypoint estimation.
- Faster convergence may reduce the compute cost of adapting the detector to new domains or datasets.
- The emphasis on anchor-box refinement suggests similar mechanisms might help stabilize training in related transformer vision architectures.
Load-bearing premise
The reported accuracy gains are produced by the three proposed techniques rather than by differences in hyper-parameter tuning or other unisolated implementation choices.
What would settle it
An ablation experiment that removes contrastive denoising, mixed query selection, or the look-forward-twice scheme and shows performance no higher than the prior DN-DETR baseline would falsify the claim that these components are decisive.
read the original abstract
We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DINO, a DETR-based end-to-end object detector that incorporates three key improvements: contrastive denoising for training, mixed query selection for anchor initialization, and a look-forward-twice scheme for box prediction. On the COCO dataset with ResNet-50 backbone, DINO achieves 49.4 AP in 12 epochs and 51.3 AP in 24 epochs, representing gains of 6.0 AP and 2.7 AP over DN-DETR. It further demonstrates strong scaling performance, reaching 63.2 AP on COCO val2017 after pre-training on Objects365 with a Swin-L backbone.
Significance. Should the gains be attributable to the proposed architectural and training innovations rather than differences in experimental setup, this work would constitute a meaningful step forward in making DETR-like detectors competitive with traditional two-stage methods in both accuracy and training efficiency. The emphasis on scaling with model and data size, combined with the promise of code release, adds to its potential impact in the computer vision community.
major comments (2)
- [Experiments section] Experiments section (and associated tables comparing to DN-DETR): The paper does not report a controlled re-implementation of the DN-DETR baseline using the exact same training schedule, data augmentations, optimizer, and hyper-parameters as DINO. Given the known sensitivity of DETR-family models to these factors, the reported +6.0 AP (12-epoch) and +2.7 AP (24-epoch) improvements cannot be confidently attributed solely to contrastive denoising, mixed query selection, and look-forward-twice rather than training-recipe differences. A matched re-run of DN-DETR is required to support the central claim.
- [Ablation studies] Ablation studies (e.g., Table 3 or equivalent): The internal ablations compare DINO variants but do not isolate the contribution of each proposed component against a fixed training recipe; this makes it difficult to determine whether the full gain arises from the combination of techniques or from cumulative hyper-parameter adjustments.
minor comments (2)
- [Abstract] The abstract states 'multi-scale features' for the ResNet-50 results; the main text should explicitly confirm whether this matches the multi-scale setup in DN-DETR or introduces additional differences.
- [Figures] Figure captions and method diagrams would benefit from clearer labeling of the three proposed components to aid readers in tracing their implementation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of DINO's contributions. We address each major comment in detail below, providing clarifications and committing to revisions where appropriate to strengthen the experimental evidence.
read point-by-point responses
-
Referee: [Experiments section] Experiments section (and associated tables comparing to DN-DETR): The paper does not report a controlled re-implementation of the DN-DETR baseline using the exact same training schedule, data augmentations, optimizer, and hyper-parameters as DINO. Given the known sensitivity of DETR-family models to these factors, the reported +6.0 AP (12-epoch) and +2.7 AP (24-epoch) improvements cannot be confidently attributed solely to contrastive denoising, mixed query selection, and look-forward-twice rather than training-recipe differences. A matched re-run of DN-DETR is required to support the central claim.
Authors: We agree that controlled re-implementations are essential given the sensitivity of DETR models to training details. DINO was developed by extending the official DN-DETR implementation and adopts the identical training schedule (including 12- and 24-epoch settings), data augmentations, optimizer (AdamW), learning rate schedule, and all other hyper-parameters as specified in the DN-DETR paper and our Section 4.1. To directly address the concern, we will add a matched re-run of DN-DETR under our exact training pipeline to the revised experiments section and tables, which will confirm that the gains originate from the three proposed components rather than recipe differences. revision: yes
-
Referee: [Ablation studies] Ablation studies (e.g., Table 3 or equivalent): The internal ablations compare DINO variants but do not isolate the contribution of each proposed component against a fixed training recipe; this makes it difficult to determine whether the full gain arises from the combination of techniques or from cumulative hyper-parameter adjustments.
Authors: We acknowledge the need for clearer isolation in ablations. Our Table 3 (and related ablation tables) begins from the DN-DETR baseline and adds components one at a time—contrastive denoising, mixed query selection, and look-forward-twice—while holding the training recipe, augmentations, optimizer, and hyper-parameters completely fixed across all variants. We will revise the ablation section and table captions to explicitly state this fixed-recipe protocol and include an additional row showing the cumulative effect, making the contribution of each module unambiguous. revision: yes
Circularity Check
No circularity: empirical results on held-out benchmarks with independent architectural claims
full rationale
The paper proposes three concrete architectural modifications (contrastive denoising training, mixed query selection for anchor initialization, and look-forward-twice box prediction) to the DETR family and reports measured AP on standard COCO val2017/test-dev splits after fixed training schedules. These are not derived quantities; they are direct experimental outcomes. No equation or claim reduces a reported performance number to a fitted constant or to a self-citation by construction. Prior work (DN-DETR) is cited only for baseline comparison; the central claims rest on the new components and their ablations within the DINO implementation, not on any uniqueness theorem or ansatz imported from the authors' earlier papers. Hyper-parameter sensitivity is a validity concern, not a circularity reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DINO achieves 49.4 AP in 12 epochs and 51.3 AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +6.0 AP and +2.7 AP, respectively, compared to DN-DETR
-
IndisputableMonolith.Foundation.LawOfExistencelaw_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 30 Pith papers
-
WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning
WD-FQDet decouples modality-shared and modality-specific features in infrared-visible images via wavelet-based frequency decomposition and frequency-aware query selection to achieve state-of-the-art detection performance.
-
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
-
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
OT-Bridge Editor uses geometrically constrained entropic optimal transport to synthesize CAG images with precise stenosis, improving downstream detection by 27.8% on ARCADE and 23.0% on a multi-center dataset.
-
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection...
-
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
-
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.
-
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
OT-Bridge Editor reframes localized image editing as a constrained entropic optimal transport problem to generate synthetic coronary angiograms that boost downstream stenosis detection by 27.8% on ARCADE and 23.0% on ...
-
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
-
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness
RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing
ZoomSpec achieves 78.1 mAP@0.5:0.95 on the SpaceNet dataset by combining log-space STFT, a coarse proposal net, adaptive heterodyne filtering, and dual-domain fine recognition to improve narrowband visibility in wideb...
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
-
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
-
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
-
Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection
Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
-
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
-
FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer
FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.
-
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight b...
-
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M...
-
OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks
OMNI-PoseX presents a unified vision model using open-vocabulary perception and SO(3)-aware reflected flow matching to deliver state-of-the-art 6D pose estimation with real-time performance for embodied tasks.
-
AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes
AMIEOD combines a multi-expert enhancement module with detection-guided regression and selection losses to raise object detection accuracy in low-illumination images.
-
FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.
-
VGGT-SLAM++
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
Reference graph
Works this paper leans on
-
[1]
Papers with code - coco test-dev benchmark (object detection)
-
[2]
YOLOv4: Optimal Speed and Accuracy of Object Detection
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Op- timal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision , pages 213–229. Springer, 2020
work page 2020
-
[4]
Hybrid task cascade for instance segmentation
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974–4983, 2019
work page 2019
-
[5]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 , 2019
work page Pith review arXiv 1906
-
[6]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Dynamic head: Unifying object detection heads with attentions
Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7373–7382, 2021
work page 2021
-
[8]
Dynamic detr: End-to-end object detection with dynamic attention
Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang. Dynamic detr: End-to-end object detection with dynamic attention. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 2988–2997, October 2021
work page 2021
-
[9]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009
work page 2009
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Fast convergence of detr with spatially modulated co-attention
Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021
-
[12]
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 , 2021
work page internal anchor Pith review arXiv 2021
-
[13]
Kaiming He, Georgia Gkioxari, Piotr Doll´ ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 2961– 2969, 2017
work page 2017
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016
work page 2016
-
[15]
Mdetr – modulated detection for end-to-end multi-modal understanding
Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr – modulated detection for end-to-end multi-modal understanding. arXiv: Computer Vision and Pattern Recognition , 2021
work page 2021
-
[16]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
Dn- detr: Accelerate detr training by introducing query denoising
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn- detr: Accelerate detr training by introducing query denoising. arXiv preprint arXiv:2203.01305, 2022. DINO: DETR with Improved DeNoising Anchor Boxes 17
-
[18]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 , 2021
-
[19]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020
work page 2020
-
[20]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014
work page 2014
-
[21]
Dab-detr: Dynamic anchor boxes are better queries for detr,
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329 , 2022
-
[22]
Swin transformer v2: Scaling up capacity and resolution, 2022
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. arXiv preprint arXiv:2111.09883 , 2021
-
[23]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021
work page 2021
-
[24]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Conditional detr for fast training convergence
Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. arXiv preprint arXiv:2108.06152, 2021
-
[26]
Mixed precision training, 2018
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018
work page 2018
-
[27]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. In International Conference on Machine Learning , 2021
work page 2021
-
[28]
Yolo9000: better, faster, stronger
Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7263– 7271, 2017
work page 2017
-
[29]
YOLOv3: An Incremental Improvement
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018
work page internal anchor Pith review arXiv 2018
-
[30]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural in- formation processing systems , 28, 2015
work page 2015
-
[31]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(6):1137–1149, 2017
work page 2017
-
[32]
Generalized intersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 658–666, 2019
work page 2019
-
[33]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 18 H. Zhang, F. Li, S. Liu, et al
work page 2019
-
[34]
Rethink- ing transformer-based set prediction for object detection
Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani. Rethink- ing transformer-based set prediction for object detection. arXiv preprint arXiv:2011.10881, 2020
-
[35]
Fcos: Fully convolutional one- stage object detection
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one- stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9627–9636, 2019
work page 2019
-
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998–6008, 2017
work page 2017
-
[37]
Anchor detr: Query design for transformer-based detector
Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. arXiv preprint arXiv:2109.07107 , 2021
-
[38]
End-to-end semi-supervised object detection with soft teacher
Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021
work page 2021
-
[39]
Efficient detr: improving end-to-end object detector with dense prior
Zhuyu Yao, Jiangbo Ai, Boxun Li, and Chi Zhang. Efficient detr: Improving end- to-end object detector with dense prior. arXiv preprint arXiv:2104.01318 , 2021
-
[40]
Florence: A new foundation model for computer vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 , 2021
-
[41]
De- formable detr: Deformable transformers for end-to-end object detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. De- formable detr: Deformable transformers for end-to-end object detection. In ICLR 2021: The Ninth International Conference on Learning Representations , 2021. DINO: DETR with Improved DeNoising Anchor Boxes 19 A Test Time Augmentations (TTA) We aim to build an end-to-end detector th...
work page 2021
-
[42]
The results demonstrate that our models are not only effective but also efficient for training
All results are reported on 8 Nvidia A100 GPUs with ResNet-50 [14]. The results demonstrate that our models are not only effective but also efficient for training. 20 H. Zhang, F. Li, S. Liu, et al. Model Batch Size per GPU Traning Time GPU Mem. Epoch AP Faster RCNN [30]* 8 ∼ 60min/ep 13GB 108 42.0 DETR [3] 8 ∼ 16min/ep 26GB 300 41.2 Deformable DETR [41] ...
work page 2000
-
[43]
relu” batch norm type “FrozenBatchNorm2d
adopts 300 decoder queries and 3 patterns [37], we use 300×3 = 900 decoder queries with the same computation cost. Learning schedules of our DINO with SwinL are available in the appendix. Loss function. We use the L1 loss and GIOU [32] loss for box regression and focal loss [19] with α = 0.25, γ = 2 for classification. As in DETR [3], we add auxiliary los...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.