RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Guanzhong Wang; Kui Huang; Qinyao Chang; Wenyu Lv; Yian Zhao; Yi Liu

arxiv: 2407.17140 · v1 · pith:YZUAWBTNnew · submitted 2024-07-24 · 💻 cs.CV

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Wenyu Lv , Yian Zhao , Qinyao Chang , Kui Huang , Guanzhong Wang , Yi Liu This is my paper

classification 💻 cs.CV

keywords rt-detrreal-timert-detrv2achievebag-of-freebiesdetectionflexibilityimprove

0 comments

read the original abstract

In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AdvScene: Rethinking Adversarial Patch Evaluation Through Scene Robustness
cs.CR 2026-05 unverdicted novelty 7.0

AdvScene is a scene-grounded evaluation method using Adversarial Patch-to-Scene Embedding (APSE) to map the operational envelope of physical adversarial patches in reconstructed real environments.
ReLeaf: Benchmarking Leaf Segmentation across Domains and Species
cs.CV 2026-05 unverdicted novelty 7.0

A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
cs.CV 2026-06 unverdicted novelty 6.0

SegFS is a dual-path architecture that uses sparse keyframe open-vocabulary predictions to condition a fast feature-space network for efficient temporal instance segmentation in videos.
Modular Diffusion Models for Structured Visual Recognition
cs.CV 2026-06 unverdicted novelty 6.0

Modular Diffusion Models decompose diffusion into task-specific modules to model distributions over structured visual outputs for detection, segmentation, and scene graph generation.
Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans
cs.AI 2026-06 unverdicted novelty 6.0

Architect-Ant fine-tunes a vision-language model on the new AntPlan-270 dataset using procedural reasoning traces and preference optimization to output editable DSL furniture layouts that can be rendered into images.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
cs.CV 2026-04 unverdicted novelty 6.0

VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors
cs.CV 2026-07 unverdicted novelty 5.0

DeWorldSG improves 3D scene graph generation from RGB-D sequences by using depth-guided 3D Gaussian object nodes and V-JEPA 2 world-model priors for spatiotemporal relation refinement, reporting large recall gains on ...
TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors
cs.CV 2026-05 unverdicted novelty 5.0

TinyFormer adds Parallel Bi-fusion Module and Spatial Semantic Adapter to a YOLO-DETR hybrid, raising small-object AP by 1.6 points to 58.5% on MS COCO while keeping real-time speed.
ConRTF: Edge-Constrained Boundary Distribution Refinement for Realtime TransFormer Table Structure Recognition
cs.CV 2026-07 unverdicted novelty 4.0

ConRTF adds an edge-constrained fine-grained localization loss to a distribution-based real-time detector to improve boundary accuracy in table structure recognition, claiming up to +1.6 GriTS gains on PubTables-1M wh...
RT-SDGOD: Real-Time Single-Domain Generalized Object Detection
cs.CV 2026-06 unverdicted novelty 4.0

RT-SDGDet applies one-to-many supervision, Discriminative Evidence Diversity Learning, and Dual-view Evidence Consistency Learning during training to reduce missed detections in real-time object detectors under unseen...
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
cs.CV 2026-06 unverdicted novelty 4.0

YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across...
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
cs.CV 2026-04 unverdicted novelty 4.0

YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
YOLO26 vs. YOLOv8: A Comprehensive Architectural Benchmark of Next-Generation Real-Time Object Detection Models
cs.CV 2026-05 unverdicted novelty 2.0

Empirical benchmark finds YOLO26 superior on Pascal VOC accuracy and efficiency but YOLOv8 faster on GPU, with both models struggling similarly on VisDrone small-object detection.