AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Alessandro Suglia; George Pantazopoulos; Hamza Mooraj

arxiv: 2603.13354 · v3 · submitted 2026-03-08 · 💻 cs.CV · cs.LG

AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Hamza Mooraj , George Pantazopoulos , Alessandro Suglia This is my paper

Pith reviewed 2026-05-15 14:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords crop disease classificationdomain shiftCNNvision-language modelsbenchmarkagricultural AIgenerative modelscontrastive learning

0 comments

The pith

Crop disease models show distinct domain-shift behaviors: CNNs peak in labs but drop in fields, while VLMs maintain more stable cross-condition performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares convolutional networks, contrastive vision-language models, and generative vision-language models on fine-grained disease identification across 16 crops. It introduces a 111k-image benchmark that cleanly separates laboratory and field imagery to measure how each family handles changes in acquisition conditions. Under identical training and evaluation protocols, CNNs reach top accuracy when test images match training conditions yet lose the most ground under shift; contrastive VLMs deliver competitive cross-domain results with lower parameter counts; generative VLMs resist distributional change best but introduce new failure modes from unparsable text outputs. These profiles matter because field deployment rarely matches the controlled conditions used in most existing crop-disease studies.

Core claim

The results reveal distinct performance profiles: CNNs achieve the highest accuracy on in-domain imagery but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance; generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation.

What carries the argument

The AgriPath-LF16 benchmark of 111k images spanning 16 crops and 41 diseases with explicit laboratory-versus-field separation, used to run all three model families under unified full, lab-only, and field-only training regimes.

If this is right

Deployment context, not aggregate accuracy, should drive architectural choice for crop disease systems.
Contrastive VLMs supply a parameter-efficient route to acceptable cross-domain reliability.
Generative VLMs need extra safeguards for output parsability to realize their domain-shift advantage.
Mixed lab-and-field training regimes affect generalization differently across the three families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agricultural benchmarks would gain predictive power by routinely adding explicit domain-split controls.
If generation reliability improves, the resilience of generative VLMs could support rapid adaptation to unseen crop varieties.
Resource-limited field deployments may prefer contrastive VLMs when labeled field data for new diseases remains scarce.

Load-bearing premise

The explicit lab-versus-field separation in the 111k-image benchmark and the unified training protocols produce a fair, representative test of domain shift that generalizes beyond the 16 crops and 41 diseases studied.

What would settle it

Retraining and retesting the same three model families on a new collection of crops or geographies that shows CNNs retaining high accuracy across lab-to-field shifts would falsify the claimed distinct degradation profiles.

read the original abstract

Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark of 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardised training and evaluation. We train and evaluate all models under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability (i.e., output parsability measured via PSR). The results reveal distinct performance profiles: CNNs achieve the highest accuracy on in-domain imagery but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance; generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate performance alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgriPath-LF16 gives a new benchmark with lab-field splits and a clean head-to-head of CNNs versus two VLM types, showing different robustness patterns, though the domain claims rest on an unverified assumption about class balance.

read the letter

The main thing to know is that this paper introduces AgriPath-LF16, a 111k-image dataset across 16 crops and 41 diseases with explicit lab versus field labels, plus the first matched comparison of CNNs, contrastive VLMs, and generative VLMs on the same crop-disease task. They train everything under unified protocols on a balanced 30k subset and report macro-F1 along with parse success rate for the generative models. This setup produces clear profiles: CNNs lead on in-domain data but drop under shift, contrastive VLMs hold steady with lower parameter cost, and generative VLMs handle variation best yet add output-parsing failures. For anyone choosing models for field deployment in agriculture, those distinctions are directly usable. The work is straightforward empirical comparison rather than new theory, and the benchmark construction itself is the real addition. The soft spot is the one the stress-test note flags. Nothing in the abstract shows that disease frequencies were matched or even checked between the lab and field portions. If certain diseases appear mostly in one setting, the CNN degradation could come from label shift instead of image domain shift, which would change how much weight to put on the resilience claims. Error bars, exact split details, and per-crop breakdowns would tighten this up. The paper is aimed at people working on applied computer vision for crops who already know the lab-to-field gap is a real barrier. A reader in precision agriculture or domain adaptation for ag tasks would get concrete guidance from the results and the dataset. It has enough new data and controlled comparisons to deserve referee time, even if the distribution checks need to be added or clarified in revision. I would send it for peer review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces AgriPath-LF16, a 111k-image benchmark spanning 16 crops and 41 diseases with explicit lab-versus-field separation, plus a balanced 30k training subset. It performs a controlled empirical comparison of CNNs, contrastive VLMs, and generative VLMs for fine-grained crop disease classification under unified training protocols (full, lab-only, field-only regimes), reporting macro-F1 and Parse Success Rate (PSR). The central claim is that the three families exhibit distinct profiles: CNNs achieve highest in-domain accuracy but degrade sharply under domain shift; contrastive VLMs are robust and parameter-efficient with competitive cross-domain results; generative VLMs show strongest resilience to distributional variation, albeit with free-text generation failures.

Significance. If the domain-shift isolation holds after verification, the work supplies a useful empirical map of architectural trade-offs for agricultural vision systems that must operate across acquisition conditions. The scale of the new benchmark and the unified-protocol design are strengths that could inform deployment decisions beyond aggregate accuracy.

major comments (1)

[Benchmark construction (Abstract and §3)] Benchmark construction (Abstract and §3): The claim that the lab-field split isolates acquisition-domain effects (and thereby explains CNN degradation) is load-bearing for the distinct-profile conclusion. The abstract states only that a 'balanced 30k subset' is used for standardised training; it provides no evidence or table showing that per-disease class frequencies are matched between the lab and field partitions. If label distributions differ systematically (e.g., certain diseases appear predominantly in one setting), the observed cross-domain drop could arise from label shift rather than imagery domain shift, weakening the generalization argument.

minor comments (1)

[Abstract] The Parse Success Rate (PSR) metric for generative VLMs is referenced without an explicit definition or formula in the abstract; a short parenthetical or pointer to its computation (e.g., fraction of outputs that parse to valid disease labels) would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The concern regarding potential label shift in the lab-field partitions is well-taken and directly impacts the strength of our domain-shift isolation claim. We address it below and commit to revisions that strengthen the empirical foundation of the work.

read point-by-point responses

Referee: Benchmark construction (Abstract and §3): The claim that the lab-field split isolates acquisition-domain effects (and thereby explains CNN degradation) is load-bearing for the distinct-profile conclusion. The abstract states only that a 'balanced 30k subset' is used for standardised training; it provides no evidence or table showing that per-disease class frequencies are matched between the lab and field partitions. If label distributions differ systematically (e.g., certain diseases appear predominantly in one setting), the observed cross-domain drop could arise from label shift rather than imagery domain shift, weakening the generalization argument.

Authors: We agree that explicit verification of class frequencies is necessary to substantiate the isolation of acquisition-domain effects. The 30k training subset was constructed by stratified sampling to ensure class balance within the training regime, but the original manuscript did not include a table comparing per-disease image counts across the full lab and field partitions. In the revised version we will add a supplementary table (and corresponding discussion in §3) that reports the exact number of images per disease in each partition. This will allow readers to quantify any label shift. If notable imbalances are present, we will additionally report cross-domain results on a re-sampled test set that enforces class balance, thereby providing a clearer separation between label shift and imagery domain shift. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model comparison on held-out benchmark data

full rationale

The paper introduces the AgriPath-LF16 benchmark and performs a controlled empirical comparison of CNNs, contrastive VLMs, and generative VLMs under unified training protocols on full, lab-only, and field-only regimes. Performance claims rest on macro-F1 and Parse Success Rate measured on held-out test sets; no equations, derivations, or first-principles results appear that reduce any reported outcome to fitted parameters or self-citations by construction. The central findings are therefore independent of internal definitions and rely on external data splits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the newly introduced benchmark faithfully captures real domain shift and that unified training protocols eliminate confounding variables across model families.

axioms (1)

domain assumption Standard machine-learning evaluation assumptions hold, including that train/test splits within lab and field subsets are representative and that macro-F1 and PSR are appropriate metrics for the task.
Implicit in the description of unified protocols and evaluation on full, lab-only, and field-only regimes.

invented entities (1)

AgriPath-LF16 benchmark no independent evidence
purpose: Enable controlled analysis of domain effects by providing 111k images with explicit lab/field separation across 16 crops and 41 diseases.
Newly created dataset introduced specifically for this study.

pith-pipeline@v0.9.0 · 5528 in / 1379 out tokens · 38407 ms · 2026-05-15T14:34:37.544713+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The results reveal distinct performance profiles: CNNs achieve the highest accuracy on in-domain imagery but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative...
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AgriPath-LF16, a benchmark of 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.