Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations
Pith reviewed 2026-05-09 20:04 UTC · model grok-4.3
The pith
CNN-based detectors deliver comparable weed detection accuracy to transformers but at far lower computational cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the GROUNDBASED_WEED dataset, YOLOv26-nano achieves high precision, recall, and average precision at lower computational cost, whereas RTDETR and RF-DETR provide stronger global context modeling at the expense of higher resource demands, establishing a practical trade-off between efficiency and contextual capability for automated weed detection.
What carries the argument
Side-by-side benchmarking of a CNN detector (YOLOv26-nano) and two transformer detectors (RTDETR, RF-DETR) using precision, recall, average precision, and inference speed on the GROUNDBASED_WEED dataset.
If this is right
- CNN-based detectors are the practical choice for real-time, resource-limited weed detection in precision agriculture.
- Transformer-based detectors become preferable only when global scene context outweighs speed and hardware constraints.
- Developers can use the measured speed-accuracy numbers to set hardware requirements for tractor-mounted systems.
- Model selection guidelines derived from this trade-off apply directly to other early-stage crop monitoring tasks.
Where Pith is reading between the lines
- Hybrid architectures that combine fast local feature extraction with selective global attention could narrow the observed gap.
- Performance differences may shift under varying lighting, crop densities, or weed species not heavily represented in the current dataset.
- Edge-device deployment in mobile farm equipment would amplify the CNN advantage through reduced power draw and latency.
Load-bearing premise
The GROUNDBASED_WEED dataset adequately represents realistic early-weed conditions and the selected models fairly stand for the convolutional and transformer approaches.
What would settle it
Repeating the evaluation on a new, more varied set of field images and finding that the transformer models match or exceed the CNN speed-accuracy balance would refute the claimed efficiency advantage.
Figures
read the original abstract
This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in tomato plantations. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and RT-DETR Large and RF-DETR Medium as transformer-based architectures. The evaluation was conducted on the GROUNDBASED_WEED dataset, considering six weed classes and an additional category corresponding to unidentified plants, which allowed for the assessment of performance in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed, as well as non-parametric statistical tests. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper performs a comparative evaluation of convolutional neural network-based (specifically YOLOv26-nano) and transformer-based (RTDETR and RF-DETR) object detection models for the task of automated weed detection in precision agriculture. Using the GROUNDBASED_WEED dataset, it measures performance via precision, recall, average precision (AP), and inference speed, concluding that there is a trade-off where CNN detectors provide high performance at lower computational cost while transformer models offer superior global context modeling at the cost of higher resource requirements. These findings are intended to guide model selection in agricultural applications.
Significance. If substantiated, the results offer practical insights into selecting detection architectures for weed management in precision farming, highlighting efficiency versus accuracy trade-offs. The work is empirical and avoids circular reasoning by relying on direct measurements. However, the absence of detailed experimental protocols and dataset characterizations reduces confidence in the generalizability of the trade-off claim to real-world early-weed detection scenarios.
major comments (2)
- [Abstract] The central claim regarding a 'clear trade-off between efficiency and contextual modeling' depends on the GROUNDBASED_WEED dataset adequately representing realistic early-weed conditions (small size, high occlusion, crop similarity). However, no dataset statistics on object scale distribution, background complexity, or comparisons to established weed detection benchmarks are provided, preventing attribution of results to paradigm differences rather than dataset artifacts.
- [Experimental Setup] The manuscript reports standard metrics but omits critical details on dataset splits (e.g., train/test ratios), training protocols, hyperparameter choices, number of runs for averaging, or statistical tests for comparing model performances. This lack of information makes it impossible to verify the soundness of the comparative results or reproduce the experiments.
minor comments (1)
- [Abstract] There is a formatting issue in the dataset name: 'GROUNDBASED_ WEED' contains an apparent space after the underscore, which should be corrected to 'GROUNDBASED_WEED'.
Simulated Author's Rebuttal
We sincerely thank the referee for their constructive feedback, which highlights important areas for improving the reproducibility and generalizability of our comparative study. We have addressed both major comments by committing to substantial additions in the revised manuscript, including new dataset analyses and detailed experimental protocols. These changes will better support our claims about efficiency-contextual modeling trade-offs while maintaining the empirical focus of the work.
read point-by-point responses
-
Referee: [Abstract] The central claim regarding a 'clear trade-off between efficiency and contextual modeling' depends on the GROUNDBASED_WEED dataset adequately representing realistic early-weed conditions (small size, high occlusion, crop similarity). However, no dataset statistics on object scale distribution, background complexity, or comparisons to established weed detection benchmarks are provided, preventing attribution of results to paradigm differences rather than dataset artifacts.
Authors: We agree that dataset characterization is essential to attribute performance differences to architectural paradigms rather than dataset-specific properties. In the revised manuscript, we will add a new subsection under 'Dataset Description' that includes: quantitative object scale distributions (histograms and statistics of bounding box areas, confirming emphasis on small objects typical of early-weed stages); background complexity metrics (e.g., average image entropy, vegetation density variance, and occlusion rates estimated via overlap analysis); and explicit comparisons to established benchmarks such as DeepWeeds and WeedNet in terms of scale, occlusion, and crop-weed visual similarity. These will be accompanied by additional figures and tables. This will strengthen the justification for the observed trade-off where CNN-based models like YOLOv26-nano achieve high accuracy at lower cost compared to transformer models. revision: yes
-
Referee: [Experimental Setup] The manuscript reports standard metrics but omits critical details on dataset splits (e.g., train/test ratios), training protocols, hyperparameter choices, number of runs for averaging, or statistical tests for comparing model performances. This lack of information makes it impossible to verify the soundness of the comparative results or reproduce the experiments.
Authors: We acknowledge this omission and apologize for not including these details in the initial submission, despite having followed a rigorous internal protocol. The revised 'Experimental Setup' section will be expanded to specify: dataset splits (80% train, 10% validation, 10% test with stratification by weed species and growth stage); complete training protocols (Adam optimizer, initial learning rate 0.001 with cosine decay, 150 epochs, specific data augmentations including random rotation, flip, and brightness adjustment); all hyperparameter choices (batch size 16, input size 640x640, weight decay 0.0005); results averaged over 5 independent runs with reported means and standard deviations; and statistical comparisons using paired t-tests to evaluate significance of differences in precision, recall, AP, and inference speed between models. A reproducibility checklist will be added as supplementary material. These revisions will enable full verification and reproduction of the experiments. revision: yes
Circularity Check
No significant circularity: purely empirical model comparison
full rationale
The paper performs a direct experimental comparison of CNN and transformer detectors on the GROUNDBASED_WEED dataset, reporting measured metrics (precision, recall, AP, inference speed) without derivations, fitted parameters renamed as predictions, or load-bearing self-citations. The central trade-off claim follows immediately from the tabulated results; no step reduces to its own inputs by construction. This is the expected outcome for an empirical benchmark study.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.