Towards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images
Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3
The pith
Vision Transformers pretrained with self-supervised learning generalize from ground to drone images for Rumex weed classification better than domain-adapted convolutional networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that Vision Transformer models pretrained with DINOv2 and DINOv3 self-supervised objectives inherently manage the domain shift from ground vehicle source data to UAV target data for Rumex classification, achieving F1 scores around 0.8 after fine-tuning on the source, which exceeds the performance of ResNet models even when those are enhanced with moment-matching or maximum classifier discrepancy domain adaptation techniques, owing to the general-purpose representations learned during pretraining.
What carries the argument
Self-supervised pretrained Vision Transformers that acquire rich, general-purpose representations during large-scale pretraining, enabling intrinsic robustness to domain shifts in image classification tasks.
If this is right
- Pretrained ViTs can achieve high target domain performance by fine-tuning only on the source dataset without additional domain adaptation training.
- Established domain adaptation techniques like moment matching provide less benefit when using ViTs compared to CNNs.
- The released AGSMultiRumex dataset supports further studies on domain adaptation for weed detection in grassland environments.
- Self-supervised pretraining may offer a general advantage for handling distribution shifts in remote sensing applications.
Where Pith is reading between the lines
- Large-scale self-supervised pretraining on internet-scale image data may provide a foundation that makes many vision models more transferable across different capture conditions without custom adaptation.
- Adopting these ViT models could simplify the development of drone-based monitoring tools for farmers, potentially enabling more frequent and accurate weed assessments in variable field conditions.
- Testing these models on data from different regions, altitudes, or lighting would reveal how broadly the observed robustness applies beyond the Swiss meadow dataset.
- Similar benefits might appear in other agricultural computer vision tasks involving shifts from close-up to overhead views.
Load-bearing premise
The better performance of the ViTs stems specifically from their self-supervised pretraining rather than differences in model architecture capacity, fine-tuning details, or characteristics of the particular datasets used.
What would settle it
Training comparable Vision Transformer models from random initialization or with only supervised pretraining on the same source data and checking whether their F1 score on the UAV target dataset falls below that of the moment-matching ResNets.
Figures
read the original abstract
Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates domain adaptation for Rumex obtusifolius classification in UAV imagery. Models are trained on a published ground-vehicle source dataset and evaluated on a new target dataset collected via 15 UAV flights over Swiss meadows. ResNet CNNs generalize poorly to the target domain even after fine-tuning; established DA methods (moment-matching, maximum classifier discrepancy) improve performance. Self-supervised pretrained Vision Transformers (DINOv2, DINOv3) achieve higher target-domain results (F1 around 0.8) without explicit adaptation, which the authors attribute to rich representations from large-scale pretraining. The AGSMultiRumex target dataset is released publicly to support further research.
Significance. If the performance attribution holds after controls, the work indicates that large-scale self-supervised pretraining on Vision Transformers can yield representations robust to source-to-UAV domain shifts in agricultural settings, potentially simplifying deployment compared with explicit DA pipelines. The public release of the AGSMultiRumex dataset is a clear strength, enabling reproducibility and community follow-up on weed detection in grassland systems.
major comments (2)
- [Abstract and results description] Abstract and results description: The central claim that DINOv2/DINOv3-pretrained ViTs handle domain shifts intrinsically well because of self-supervised pretraining is load-bearing but unsecured. No ablation studies compare against supervised-pretrained ViTs of matched size, randomly initialized ViTs, or parameter-matched CNN baselines, so it remains unclear whether the reported F1≈0.8 gains arise from the pretraining objective, architectural differences, or model capacity.
- [Target dataset section] Target dataset section: The AGSMultiRumex dataset is described only at high level (15 flights over Swiss meadows) without statistics on image resolution, altitude/viewpoint distributions, weed density, or lighting conditions. These details are necessary to evaluate whether dataset characteristics independently favor transformer attention patterns and to assess broader representativeness of the UAV conditions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects that will improve the clarity and rigor of our work on domain adaptation for Rumex detection. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and results description] The central claim that DINOv2/DINOv3-pretrained ViTs handle domain shifts intrinsically well because of self-supervised pretraining is load-bearing but unsecured. No ablation studies compare against supervised-pretrained ViTs of matched size, randomly initialized ViTs, or parameter-matched CNN baselines, so it remains unclear whether the reported F1≈0.8 gains arise from the pretraining objective, architectural differences, or model capacity.
Authors: We acknowledge that the manuscript does not include the suggested ablations (supervised-pretrained ViTs, randomly initialized ViTs, or parameter-matched CNNs), which limits our ability to isolate the exact contribution of the self-supervised pretraining objective versus architecture or capacity. Our primary evidence is the consistent outperformance of the DINO-pretrained ViTs over both standard and domain-adapted ResNets on the target UAV domain. We will revise the abstract and results sections to temper the attribution language, explicitly noting that gains are observed relative to the ResNet baselines, and add a limitations discussion paragraph addressing the potential roles of pretraining type, architecture, and model scale. This revision will be made without new experiments, as the current results already demonstrate practical utility for the UAV task. revision: partial
-
Referee: [Target dataset section] The AGSMultiRumex dataset is described only at high level (15 flights over Swiss meadows) without statistics on image resolution, altitude/viewpoint distributions, weed density, or lighting conditions. These details are necessary to evaluate whether dataset characteristics independently favor transformer attention patterns and to assess broader representativeness of the UAV conditions.
Authors: We agree that additional dataset statistics are needed for proper evaluation of representativeness and potential biases. We will expand the Target Dataset section with quantitative details including: mean and range of image resolutions, histograms of flight altitudes and camera viewpoints, average weed density (plants per image) with standard deviation, and qualitative/quantitative notes on lighting variations across the 15 flights. These additions will be supported by the raw metadata from the UAV collection campaign. revision: yes
Circularity Check
No circularity; purely empirical evaluation of pretrained models on held-out UAV data
full rationale
The manuscript reports measured F1 scores from fine-tuning published ViT and ResNet models on a source dataset and evaluating on a custom held-out UAV target dataset (AGSMultiRumex). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance differences are presented as direct experimental outcomes rather than constructed from the inputs by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2010.11929
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Eichhorn, F.C., Kneer, S., Görges, D., 2025. Low-cost automated genera- tion of application maps for control of Rumex Obtusifolius in grasslands. Precision Agriculture 26, 1–22. doi:10.1007/s11119-025-10242-4. Espejo-Garcia, B., Güldenring, R., Na...
-
[2]
Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864
Mapping of Rumex obtusifolius in nature conservation areas us- ing very high resolution UAV imagery and deep learning. Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864. URL:https://www.sciencedirect.com/science/article/ pii/S1569843222000668, doi:https://doi.org/10.1016/j.jag.2022. 102864. Van Evert, F.K., Polder, G., Va...
-
[3]
doi:https://doi.org/10.1111/j.1365-3180.2008.00682.x, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3180.2008.00682.x. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30. Xin, Y., Yang, J., Luo, S., Du,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.