Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

Alcides Toledo Espinosa; \'Angel Eduardo Zamora-Su\'arez; Gerardo Antonio \'Alvarez Hern\'andez; Juan Irving V\'asquez; Miguel Bola\~nos

arxiv: 2605.00908 · v2 · pith:HCPJE4IAnew · submitted 2026-04-29 · 💻 cs.CV

Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

Alcides Toledo Espinosa , Gerardo Antonio \'Alvarez Hern\'andez , \'Angel Eduardo Zamora-Su\'arez , Miguel Bola\~nos , Juan Irving V\'asquez This is my paper

Pith reviewed 2026-05-09 20:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords weed detectionobject detectionconvolutional neural networkstransformersprecision agricultureYOLORTDETRcomparative evaluation

0 comments

The pith

CNN-based detectors deliver comparable weed detection accuracy to transformers but at far lower computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates representative convolutional and transformer object detectors on early weed spotting in farm images. It runs YOLOv26-nano alongside RTDETR and RF-DETR on the GROUNDBASED_WEED dataset and records both detection quality and inference speed. A consistent pattern appears: the convolutional model reaches high precision and recall while running substantially faster. The transformer models capture broader scene context but demand more processing power. The comparison supplies direct guidance on selecting models for real-time precision agriculture systems.

Core claim

On the GROUNDBASED_WEED dataset, YOLOv26-nano achieves high precision, recall, and average precision at lower computational cost, whereas RTDETR and RF-DETR provide stronger global context modeling at the expense of higher resource demands, establishing a practical trade-off between efficiency and contextual capability for automated weed detection.

What carries the argument

Side-by-side benchmarking of a CNN detector (YOLOv26-nano) and two transformer detectors (RTDETR, RF-DETR) using precision, recall, average precision, and inference speed on the GROUNDBASED_WEED dataset.

If this is right

CNN-based detectors are the practical choice for real-time, resource-limited weed detection in precision agriculture.
Transformer-based detectors become preferable only when global scene context outweighs speed and hardware constraints.
Developers can use the measured speed-accuracy numbers to set hardware requirements for tractor-mounted systems.
Model selection guidelines derived from this trade-off apply directly to other early-stage crop monitoring tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid architectures that combine fast local feature extraction with selective global attention could narrow the observed gap.
Performance differences may shift under varying lighting, crop densities, or weed species not heavily represented in the current dataset.
Edge-device deployment in mobile farm equipment would amplify the CNN advantage through reduced power draw and latency.

Load-bearing premise

The GROUNDBASED_WEED dataset adequately represents realistic early-weed conditions and the selected models fairly stand for the convolutional and transformer approaches.

What would settle it

Repeating the evaluation on a new, more varied set of field images and finding that the transformer models match or exceed the CNN speed-accuracy balance would refute the claimed efficiency advantage.

Figures

Figures reproduced from arXiv: 2605.00908 by Alcides Toledo Espinosa, \'Angel Eduardo Zamora-Su\'arez, Gerardo Antonio \'Alvarez Hern\'andez, Juan Irving V\'asquez, Miguel Bola\~nos.

**Figure 1.** Figure 1: Distribution of annotated instances in the GROUNDBASED_WEED dataset, highlighting class imbalance and ambiguous samples. These characteristics make GROUNDBASED_WEED a realistic benchmark for evaluating model robustness and generalization, bridging the gap between controlled experiments and real-world deployment [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Example of YOLOv26-nano inference trained on the GROUNDBASED_WEED dataset It is concluded that hierarchical CNN optimization provides a more robust approach for real-time precision agriculture applications, particularly for handling small, dispersed objects. Consequently, YOLOv26-nano emerges as the most viable architecture for integration into embedded robotic systems operating under real-world field c… view at source ↗

read the original abstract

This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in tomato plantations. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and RT-DETR Large and RF-DETR Medium as transformer-based architectures. The evaluation was conducted on the GROUNDBASED_WEED dataset, considering six weed classes and an additional category corresponding to unidentified plants, which allowed for the assessment of performance in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed, as well as non-parametric statistical tests. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routine benchmark comparing a few existing detectors on one weed dataset, with no new method or insight.

read the letter

This paper runs YOLOv26-nano against RTDETR and RF-DETR on the GROUNDBASED_WEED dataset and reports the usual precision, recall, AP, and speed numbers. The central observation is the expected trade-off: the CNN version is faster with competitive accuracy while the transformer models pick up more context at higher cost. That is basically the whole contribution. Nothing new in architecture, loss, or training approach appears here. It simply applies established detectors to one domain-specific task. The practical numbers on efficiency versus accuracy could help an engineer picking a model for a tractor-mounted camera, and the authors do lay out the metrics clearly enough for that narrow use. Beyond that, the work stays within standard empirical comparison. The soft spots are straightforward and fairly large. The entire claim about the trade-off depends on the dataset actually reflecting early-weed conditions with small objects, occlusion, and field variability, yet no scale histograms, background statistics, or comparison to other weed corpora are provided. Without those checks the performance gap could easily be an artifact of this particular collection rather than a general CNN-versus-transformer property. Training splits, hyperparameter details, and any statistical tests are also absent, so the numbers are hard to reproduce or generalize. This is the kind of paper that might interest a small group of precision-agriculture practitioners who need quick model-selection guidance for similar camera setups. A reader already working on weed or crop detection might scan the tables for ballpark figures. It will not interest anyone looking for advances in object detection itself. I would not bring it to a reading group. It lacks the depth or novelty to merit serious referee time in computer vision venues. The authors would need to add multiple datasets, scale analysis, and error breakdowns before it could hold up under review.

Referee Report

2 major / 1 minor

Summary. This paper performs a comparative evaluation of convolutional neural network-based (specifically YOLOv26-nano) and transformer-based (RTDETR and RF-DETR) object detection models for the task of automated weed detection in precision agriculture. Using the GROUNDBASED_WEED dataset, it measures performance via precision, recall, average precision (AP), and inference speed, concluding that there is a trade-off where CNN detectors provide high performance at lower computational cost while transformer models offer superior global context modeling at the cost of higher resource requirements. These findings are intended to guide model selection in agricultural applications.

Significance. If substantiated, the results offer practical insights into selecting detection architectures for weed management in precision farming, highlighting efficiency versus accuracy trade-offs. The work is empirical and avoids circular reasoning by relying on direct measurements. However, the absence of detailed experimental protocols and dataset characterizations reduces confidence in the generalizability of the trade-off claim to real-world early-weed detection scenarios.

major comments (2)

[Abstract] The central claim regarding a 'clear trade-off between efficiency and contextual modeling' depends on the GROUNDBASED_WEED dataset adequately representing realistic early-weed conditions (small size, high occlusion, crop similarity). However, no dataset statistics on object scale distribution, background complexity, or comparisons to established weed detection benchmarks are provided, preventing attribution of results to paradigm differences rather than dataset artifacts.
[Experimental Setup] The manuscript reports standard metrics but omits critical details on dataset splits (e.g., train/test ratios), training protocols, hyperparameter choices, number of runs for averaging, or statistical tests for comparing model performances. This lack of information makes it impossible to verify the soundness of the comparative results or reproduce the experiments.

minor comments (1)

[Abstract] There is a formatting issue in the dataset name: 'GROUNDBASED_ WEED' contains an apparent space after the underscore, which should be corrected to 'GROUNDBASED_WEED'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their constructive feedback, which highlights important areas for improving the reproducibility and generalizability of our comparative study. We have addressed both major comments by committing to substantial additions in the revised manuscript, including new dataset analyses and detailed experimental protocols. These changes will better support our claims about efficiency-contextual modeling trade-offs while maintaining the empirical focus of the work.

read point-by-point responses

Referee: [Abstract] The central claim regarding a 'clear trade-off between efficiency and contextual modeling' depends on the GROUNDBASED_WEED dataset adequately representing realistic early-weed conditions (small size, high occlusion, crop similarity). However, no dataset statistics on object scale distribution, background complexity, or comparisons to established weed detection benchmarks are provided, preventing attribution of results to paradigm differences rather than dataset artifacts.

Authors: We agree that dataset characterization is essential to attribute performance differences to architectural paradigms rather than dataset-specific properties. In the revised manuscript, we will add a new subsection under 'Dataset Description' that includes: quantitative object scale distributions (histograms and statistics of bounding box areas, confirming emphasis on small objects typical of early-weed stages); background complexity metrics (e.g., average image entropy, vegetation density variance, and occlusion rates estimated via overlap analysis); and explicit comparisons to established benchmarks such as DeepWeeds and WeedNet in terms of scale, occlusion, and crop-weed visual similarity. These will be accompanied by additional figures and tables. This will strengthen the justification for the observed trade-off where CNN-based models like YOLOv26-nano achieve high accuracy at lower cost compared to transformer models. revision: yes
Referee: [Experimental Setup] The manuscript reports standard metrics but omits critical details on dataset splits (e.g., train/test ratios), training protocols, hyperparameter choices, number of runs for averaging, or statistical tests for comparing model performances. This lack of information makes it impossible to verify the soundness of the comparative results or reproduce the experiments.

Authors: We acknowledge this omission and apologize for not including these details in the initial submission, despite having followed a rigorous internal protocol. The revised 'Experimental Setup' section will be expanded to specify: dataset splits (80% train, 10% validation, 10% test with stratification by weed species and growth stage); complete training protocols (Adam optimizer, initial learning rate 0.001 with cosine decay, 150 epochs, specific data augmentations including random rotation, flip, and brightness adjustment); all hyperparameter choices (batch size 16, input size 640x640, weight decay 0.0005); results averaged over 5 independent runs with reported means and standard deviations; and statistical comparisons using paired t-tests to evaluate significance of differences in precision, recall, AP, and inference speed between models. A reproducibility checklist will be added as supplementary material. These revisions will enable full verification and reproduction of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical model comparison

full rationale

The paper performs a direct experimental comparison of CNN and transformer detectors on the GROUNDBASED_WEED dataset, reporting measured metrics (precision, recall, AP, inference speed) without derivations, fitted parameters renamed as predictions, or load-bearing self-citations. The central trade-off claim follows immediately from the tabulated results; no step reduces to its own inputs by construction. This is the expected outcome for an empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study that relies on standard machine-learning evaluation practices and publicly known model architectures. No new free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5458 in / 1027 out tokens · 49913 ms · 2026-05-09T20:04:18.160831+00:00 · methodology

Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)