RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ajay Sharda; Manoj Karkee; Rahul Harsha Cheppally; Ranjan Sapkota

arxiv: 2504.13099 · v1 · pith:RUYPTNH6new · submitted 2025-04-17 · 💻 cs.CV

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota , Rahul Harsha Cheppally , Ajay Sharda , Manoj Karkee This is my paper

classification 💻 cs.CV

keywords detectionrf-detrobjectyolov12greenfruitsmodelsingle-classcomplex

0 comments

read the original abstract

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transformer-Based Source Detection and Morphological Classification in LOFAR Deep-Field Continuum Images
astro-ph.IM 2026-05 unverdicted novelty 5.0

RF-DETR trained on ELAIS-N1 achieves ~91% F1 for detection and morphology classification on LOFAR images, generalizes to other fields, recovers most PyBDSF sources as single entities rather than fragmented components,...
Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge
cs.CV 2026-06 unverdicted novelty 4.0

MIDOG 2025 challenge shows top mitosis detection F1 of 0.740 and atypical figure balanced accuracy of 0.908 across diverse tumors, with clear drops in challenging regions and tumor-type variation.