Towards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images

Fabian Dionys Schrag; Konrad Schindler; Mehmet Ozgur Turkoglu; Ralph Lukas Stoop

arxiv: 2604.25316 · v1 · submitted 2026-04-28 · 💻 cs.CV

Towards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images

Fabian Dionys Schrag , Mehmet Ozgur Turkoglu , Konrad Schindler , Ralph Lukas Stoop This is my paper

Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords Rumex obtusifolius detectiondomain adaptationVision TransformersUAV imageryself-supervised learningweed detectiondrone imagesdeep learning

0 comments

The pith

Vision Transformers pretrained with self-supervised learning generalize from ground to drone images for Rumex weed classification better than domain-adapted convolutional networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how well deep learning models trained on ground-based images of Rumex obtusifolius can be applied to images taken from drones over meadows. It shows that typical CNNs like ResNets fail to transfer well even after fine-tuning, but applying domain adaptation methods improves results. In contrast, Vision Transformers that were pretrained using self-supervised techniques on large datasets perform strongly on the new drone data without needing those adaptation steps, reaching useful accuracy levels. This finding points to a simpler way to build reliable systems for mapping weeds in agriculture using aerial imagery, and the authors provide a new public dataset from 15 drone flights to support more work in this area.

Core claim

The authors demonstrate that Vision Transformer models pretrained with DINOv2 and DINOv3 self-supervised objectives inherently manage the domain shift from ground vehicle source data to UAV target data for Rumex classification, achieving F1 scores around 0.8 after fine-tuning on the source, which exceeds the performance of ResNet models even when those are enhanced with moment-matching or maximum classifier discrepancy domain adaptation techniques, owing to the general-purpose representations learned during pretraining.

What carries the argument

Self-supervised pretrained Vision Transformers that acquire rich, general-purpose representations during large-scale pretraining, enabling intrinsic robustness to domain shifts in image classification tasks.

If this is right

Pretrained ViTs can achieve high target domain performance by fine-tuning only on the source dataset without additional domain adaptation training.
Established domain adaptation techniques like moment matching provide less benefit when using ViTs compared to CNNs.
The released AGSMultiRumex dataset supports further studies on domain adaptation for weed detection in grassland environments.
Self-supervised pretraining may offer a general advantage for handling distribution shifts in remote sensing applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large-scale self-supervised pretraining on internet-scale image data may provide a foundation that makes many vision models more transferable across different capture conditions without custom adaptation.
Adopting these ViT models could simplify the development of drone-based monitoring tools for farmers, potentially enabling more frequent and accurate weed assessments in variable field conditions.
Testing these models on data from different regions, altitudes, or lighting would reveal how broadly the observed robustness applies beyond the Swiss meadow dataset.
Similar benefits might appear in other agricultural computer vision tasks involving shifts from close-up to overhead views.

Load-bearing premise

The better performance of the ViTs stems specifically from their self-supervised pretraining rather than differences in model architecture capacity, fine-tuning details, or characteristics of the particular datasets used.

What would settle it

Training comparable Vision Transformer models from random initialization or with only supervised pretraining on the same source data and checking whether their F1 score on the UAV target dataset falls below that of the moment-matching ResNets.

Figures

Figures reproduced from arXiv: 2604.25316 by Fabian Dionys Schrag, Konrad Schindler, Mehmet Ozgur Turkoglu, Ralph Lukas Stoop.

**Figure 1.** Figure 1: Experimental setup with ground-robot-based source domain (left) used for train view at source ↗

**Figure 2.** Figure 2: Example demonstrating the translation from the original dataset used for object view at source ↗

**Figure 3.** Figure 3: Label distribution of the Rumex classification task in source and target data. view at source ↗

**Figure 4.** Figure 4: Model performance on AGSMultiRumex vs. number of trainable parameters for view at source ↗

**Figure 5.** Figure 5: F1 score for each flight for compared methods. 16 view at source ↗

read the original abstract

Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a new public UAV dataset for Rumex detection and reports that pretrained ViTs reach F1~0.8 on it after source-only fine-tuning, beating adapted ResNets, but the attribution to self-supervised pretraining lacks the needed controls.

read the letter

The main contribution is the AGSMultiRumex dataset from 15 UAV flights over Swiss meadows, released publicly, along with the direct comparison showing DINO-pretrained ViTs handling the shift from ground-vehicle source images better than ResNets even after moment-matching or maximum classifier discrepancy adaptation. They fine-tune the ViTs only on the source and still get solid target performance around F1=0.8, which is a practical result for weed mapping work. Releasing the dataset is the clearest value here, since it gives others a real aerial target set to test against instead of relying on synthetic shifts or the same old benchmarks. The experiments stay focused on an actual agriculture problem and the numbers are presented head-to-head without overclaiming novelty in the methods themselves. The soft spot is the missing controls on why the ViTs win. The paper credits the self-supervised pretraining for rich representations, but ViT models are typically larger than the ResNet baselines and have different inductive biases that could suit overhead views or varying scales independently of pretraining. There are no ablations on supervised-pretrained ViTs of matched size, randomly initialized ViTs, or parameter-matched CNNs, so capacity or architecture could explain the gap as easily as the DINO objective. The target dataset description stays high-level, leaving open whether resolution, viewpoint distribution, or weed density patterns happen to favor attention mechanisms. This is for people doing drone-based weed detection or domain adaptation in agriculture who need a new real-world target set. A reader working on practical UAV applications would get usable numbers and data to build from. I would send it for peer review because the dataset release and the empirical comparison are concrete enough to be worth referee time, even if the explanation for the ViT advantage needs tightening with ablations.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates domain adaptation for Rumex obtusifolius classification in UAV imagery. Models are trained on a published ground-vehicle source dataset and evaluated on a new target dataset collected via 15 UAV flights over Swiss meadows. ResNet CNNs generalize poorly to the target domain even after fine-tuning; established DA methods (moment-matching, maximum classifier discrepancy) improve performance. Self-supervised pretrained Vision Transformers (DINOv2, DINOv3) achieve higher target-domain results (F1 around 0.8) without explicit adaptation, which the authors attribute to rich representations from large-scale pretraining. The AGSMultiRumex target dataset is released publicly to support further research.

Significance. If the performance attribution holds after controls, the work indicates that large-scale self-supervised pretraining on Vision Transformers can yield representations robust to source-to-UAV domain shifts in agricultural settings, potentially simplifying deployment compared with explicit DA pipelines. The public release of the AGSMultiRumex dataset is a clear strength, enabling reproducibility and community follow-up on weed detection in grassland systems.

major comments (2)

[Abstract and results description] Abstract and results description: The central claim that DINOv2/DINOv3-pretrained ViTs handle domain shifts intrinsically well because of self-supervised pretraining is load-bearing but unsecured. No ablation studies compare against supervised-pretrained ViTs of matched size, randomly initialized ViTs, or parameter-matched CNN baselines, so it remains unclear whether the reported F1≈0.8 gains arise from the pretraining objective, architectural differences, or model capacity.
[Target dataset section] Target dataset section: The AGSMultiRumex dataset is described only at high level (15 flights over Swiss meadows) without statistics on image resolution, altitude/viewpoint distributions, weed density, or lighting conditions. These details are necessary to evaluate whether dataset characteristics independently favor transformer attention patterns and to assess broader representativeness of the UAV conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects that will improve the clarity and rigor of our work on domain adaptation for Rumex detection. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract and results description] The central claim that DINOv2/DINOv3-pretrained ViTs handle domain shifts intrinsically well because of self-supervised pretraining is load-bearing but unsecured. No ablation studies compare against supervised-pretrained ViTs of matched size, randomly initialized ViTs, or parameter-matched CNN baselines, so it remains unclear whether the reported F1≈0.8 gains arise from the pretraining objective, architectural differences, or model capacity.

Authors: We acknowledge that the manuscript does not include the suggested ablations (supervised-pretrained ViTs, randomly initialized ViTs, or parameter-matched CNNs), which limits our ability to isolate the exact contribution of the self-supervised pretraining objective versus architecture or capacity. Our primary evidence is the consistent outperformance of the DINO-pretrained ViTs over both standard and domain-adapted ResNets on the target UAV domain. We will revise the abstract and results sections to temper the attribution language, explicitly noting that gains are observed relative to the ResNet baselines, and add a limitations discussion paragraph addressing the potential roles of pretraining type, architecture, and model scale. This revision will be made without new experiments, as the current results already demonstrate practical utility for the UAV task. revision: partial
Referee: [Target dataset section] The AGSMultiRumex dataset is described only at high level (15 flights over Swiss meadows) without statistics on image resolution, altitude/viewpoint distributions, weed density, or lighting conditions. These details are necessary to evaluate whether dataset characteristics independently favor transformer attention patterns and to assess broader representativeness of the UAV conditions.

Authors: We agree that additional dataset statistics are needed for proper evaluation of representativeness and potential biases. We will expand the Target Dataset section with quantitative details including: mean and range of image resolutions, histograms of flight altitudes and camera viewpoints, average weed density (plants per image) with standard deviation, and qualitative/quantitative notes on lighting variations across the 15 flights. These additions will be supported by the raw metadata from the UAV collection campaign. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation of pretrained models on held-out UAV data

full rationale

The manuscript reports measured F1 scores from fine-tuning published ViT and ResNet models on a source dataset and evaluating on a custom held-out UAV target dataset (AGSMultiRumex). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance differences are presented as direct experimental outcomes rather than constructed from the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies entirely on standard pretrained models, published domain adaptation algorithms, and conventional supervised fine-tuning without introducing new free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5545 in / 1187 out tokens · 58741 ms · 2026-05-07T16:46:07.580813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

arXiv preprint arXiv:2010.11929

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Eichhorn, F.C., Kneer, S., Görges, D., 2025. Low-cost automated genera- tion of application maps for control of Rumex Obtusifolius in grasslands. Precision Agriculture 26, 1–22. doi:10.1007/s11119-025-10242-4. Espejo-Garcia, B., Güldenring, R., Na...

work page doi:10.1007/s11119-025-10242-4 2010
[2]

Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864

Mapping of Rumex obtusifolius in nature conservation areas us- ing very high resolution UAV imagery and deep learning. Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864. URL:https://www.sciencedirect.com/science/article/ pii/S1569843222000668, doi:https://doi.org/10.1016/j.jag.2022. 102864. Van Evert, F.K., Polder, G., Va...

work page doi:10.1016/j.jag.2022 2022
[3]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017

doi:https://doi.org/10.1111/j.1365-3180.2008.00682.x, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3180.2008.00682.x. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30. Xin, Y., Yang, J., Luo, S., Du,...

work page doi:10.1111/j.1365-3180.2008.00682.x 2008

[1] [1]

arXiv preprint arXiv:2010.11929

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Eichhorn, F.C., Kneer, S., Görges, D., 2025. Low-cost automated genera- tion of application maps for control of Rumex Obtusifolius in grasslands. Precision Agriculture 26, 1–22. doi:10.1007/s11119-025-10242-4. Espejo-Garcia, B., Güldenring, R., Na...

work page doi:10.1007/s11119-025-10242-4 2010

[2] [2]

Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864

Mapping of Rumex obtusifolius in nature conservation areas us- ing very high resolution UAV imagery and deep learning. Interna- tional Journal of Applied Earth Observation and Geoinformation 112, 102864. URL:https://www.sciencedirect.com/science/article/ pii/S1569843222000668, doi:https://doi.org/10.1016/j.jag.2022. 102864. Van Evert, F.K., Polder, G., Va...

work page doi:10.1016/j.jag.2022 2022

[3] [3]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017

doi:https://doi.org/10.1111/j.1365-3180.2008.00682.x, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3180.2008.00682.x. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30. Xin, Y., Yang, J., Luo, S., Du,...

work page doi:10.1111/j.1365-3180.2008.00682.x 2008