RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer
read the original abstract
In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.
This paper has not been read by Pith yet.
Forward citations
Cited by 14 Pith papers
-
AdvScene: Rethinking Adversarial Patch Evaluation Through Scene Robustness
AdvScene is a scene-grounded evaluation method using Adversarial Patch-to-Scene Embedding (APSE) to map the operational envelope of physical adversarial patches in reconstructed real environments.
-
ReLeaf: Benchmarking Leaf Segmentation across Domains and Species
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
-
Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
SegFS is a dual-path architecture that uses sparse keyframe open-vocabulary predictions to condition a fast feature-space network for efficient temporal instance segmentation in videos.
-
Modular Diffusion Models for Structured Visual Recognition
Modular Diffusion Models decompose diffusion into task-specific modules to model distributions over structured visual outputs for detection, segmentation, and scene graph generation.
-
Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans
Architect-Ant fine-tunes a vision-language model on the new AntPlan-270 dataset using procedural reasoning traces and preference optimization to output editable DSL furniture layouts that can be rendered into images.
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors
DeWorldSG improves 3D scene graph generation from RGB-D sequences by using depth-guided 3D Gaussian object nodes and V-JEPA 2 world-model priors for spatiotemporal relation refinement, reporting large recall gains on ...
-
TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors
TinyFormer adds Parallel Bi-fusion Module and Spatial Semantic Adapter to a YOLO-DETR hybrid, raising small-object AP by 1.6 points to 58.5% on MS COCO while keeping real-time speed.
-
ConRTF: Edge-Constrained Boundary Distribution Refinement for Realtime TransFormer Table Structure Recognition
ConRTF adds an edge-constrained fine-grained localization loss to a distribution-based real-time detector to improve boundary accuracy in table structure recognition, claiming up to +1.6 GriTS gains on PubTables-1M wh...
-
RT-SDGOD: Real-Time Single-Domain Generalized Object Detection
RT-SDGDet applies one-to-many supervision, Discriminative Evidence Diversity Learning, and Dual-view Evidence Consistency Learning during training to reduce missed detections in real-time object detectors under unseen...
-
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across...
-
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
-
YOLO26 vs. YOLOv8: A Comprehensive Architectural Benchmark of Next-Generation Real-Time Object Detection Models
Empirical benchmark finds YOLO26 superior on Pascal VOC accuracy and efficiency but YOLOv8 faster on GPU, with both models struggling similarly on VisDrone small-object detection.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.