arxiv: 2604.23701 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge

Wentao Zhang , Qi Zhang , Mingkun Xu , Mu You , Henghua Shen , Zhongzhi He , Keyan Jin , Derek F. Wong

show 1 more author

Tao Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords agricultural disease diagnosistraining-free frameworkcaption generationLLM as judgeexplainable AIvision-language modelscrop pest identificationmorphological description

0 comments

The pith

Refined morphological captions plus an LLM judge raise agricultural disease diagnosis accuracy by over 20 points without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agri-CPJ, a training-free framework that first has a vision-language model describe the visible morphology of a crop photo in a structured caption. This caption is then refined by checking it against multiple quality dimensions before any diagnosis occurs. Two separate diagnostic answers are generated from different angles, and a separate LLM judge picks the stronger one using agricultural criteria. The authors show that the caption step alone accounts for most of the gain, turning opaque image-to-answer guesses into traceable reasoning that a farmer can inspect. On one benchmark the method adds 22.7 percentage points to disease classification accuracy and 19.5 points to question-answering scores compared with direct prompting.

Core claim

By inserting an intermediate, quality-gated morphological caption between the image and the diagnostic question, and then letting an LLM judge select between two complementary answers, the framework produces both higher accuracy and an explicit audit trail of which visual observations supported the final diagnosis. Ablations show that removing the caption-refinement step consistently lowers performance across tested models, while the full pipeline reaches 77.84 percent on a multiple-choice agricultural benchmark with GPT-5-Nano and remains competitive with larger open models despite the format change.

What carries the argument

The Caption-Prompt-Judge loop: a vision-language model first produces a structured morphological caption that is iteratively refined by multi-dimensional quality gates, after which two candidate diagnoses are generated and an LLM judge selects the better one using domain-specific criteria.

If this is right

Caption refinement is the dominant driver of the reported gains, so any implementation can start by adding only that step.
The generated caption and judge rationale together give a practitioner a concrete list of visual observations to verify or contest.
The method works across different vision-language models without task-specific retraining.
Performance holds when the output format changes from open-ended answers to multiple-choice questions.
No fine-tuning or additional labeled data is required beyond the base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caption-plus-judge pattern could be tested on other visual identification tasks such as plant nutrient deficiency or livestock health where field photos are common.
Because the framework depends on the quality of the initial caption generator, models with weaker visual grounding may see smaller or negative gains.
The audit trail could be extended to let the judge also flag low-confidence observations so a human can request a second photo.
Cost and latency scale with the number of caption refinements and judge calls, which may limit real-time mobile use on older devices.

Load-bearing premise

That the multi-dimensional quality checks on the generated captions and the domain-specific rules used by the LLM judge will reliably produce better inputs and selections than asking the model to answer directly from the raw image.

What would settle it

Running the same models on CDDMBench but with the quality-gating step disabled and finding that accuracy drops back to or below the no-caption baseline.

Figures

Figures reproduced from arXiv: 2604.23701 by Derek F. Wong, Henghua Shen, Keyan Jin, Mingkun Xu, Mu You, Qi Zhang, Tao Fang, Wentao Zhang, Zhongzhi He.

**Figure 1.** Figure 1: Comparison between traditional VLM and Agri-CPJ framework on crop species identification. Traditional view at source ↗

**Figure 2.** Figure 2: Overview of the “Agri-CPJ” pipeline for explainable Agri-Pest VQA, featuring three cohesive stages: view at source ↗

**Figure 3.** Figure 3: Correlation between caption quality and diagnostic accuracy. The scatter plot shows a positive correlation view at source ↗

**Figure 4.** Figure 4: Performance trends in the ablation study using Qwen2.5-VL-72B-Instruct with Explanational Captions view at source ↗

**Figure 5.** Figure 5: Task-specific performance on AgMMU-MCQs across five agricultural tasks. Our full Agri-CPJ with GPT view at source ↗

read the original abstract

Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84\% and Qwen-VL-Chat reached 64.54\%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agri-CPJ adds caption refinement and an LLM judge to VLM diagnosis for better accuracy and traceability on ag images, but the reported gains may hinge on specific prompt choices rather than the full framework.

read the letter

Agri-CPJ is a straightforward training-free way to boost VLM performance on pest and disease photos by forcing a structured caption step first and then letting an LLM judge pick the best answer from two views. The paper does a few things right. It lays out the full pipeline clearly, shows through ablations that skipping the caption refinement hurts accuracy the most, and releases the code and data. The explainability angle is practical: the caption and judge output give a traceable reason for the diagnosis that a user can check. On the benchmarks they use, the numbers look better than the no-caption baseline, and they test across a couple of models including open ones. The main concern is whether the gains are robust or tied to the particular choices in the gating dimensions and the judge's domain criteria. The abstract highlights the +22.7 point lift, but without comparisons to other common training-free techniques like direct chain-of-thought on the image or different caption prompts, it's possible the improvement comes more from careful prompt engineering than from the overall CPJ structure. The stress test note raises a fair point here. This work is for applied researchers and practitioners in agricultural AI who need better accuracy and some transparency without collecting training data. Someone looking for ready-to-use prompting strategies in vision-language tasks for narrow domains would find the details and the repo helpful. The paper shows clear thinking about the failure modes of current models and tries to address them directly with the framework. I think it should go to peer review. The empirical results and open resources make it worth a closer look even if the central claims need more baseline comparisons to land solidly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Agri-CPJ, a training-free framework for crop disease and pest diagnosis from field images. A VLM generates structured morphological captions that are iteratively refined via multi-dimensional quality gating; two candidate diagnostic responses are then produced from complementary viewpoints and an LLM judge selects the stronger one using domain-specific criteria. The paper reports that caption refinement has the largest ablation impact, with GPT-5-Nano plus GPT-5-mini captions yielding +22.7 pp classification accuracy and +19.5 points QA score on CDDMBench over no-caption baselines, plus competitive results (77.84% and 64.54%) on AgMMU-MCQs. The caption and judge rationale are presented as an interpretable audit trail, with code and data released publicly.

Significance. If the gains prove robust and generalizable beyond the specific gating dimensions and judge rubric, the work would supply a practical, training-free route to explainable agricultural diagnosis that leverages existing VLMs without fine-tuning. The public code release is a clear strength that supports reproducibility and follow-on work. The significance is currently limited by the absence of head-to-head comparisons against other training-free prompting baselines, which leaves open whether the reported lifts are framework-driven or tied to the particular implementation choices.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: The central claim that the CPJ framework (caption refinement + judge selection) reliably outperforms direct image-to-answer prompting rests on the +22.7 pp gain, yet the manuscript provides no comparison against alternative training-free strategies (e.g., standard chain-of-thought or visual chain-of-thought) using the identical model pair. This omission is load-bearing because the abstract already states that caption refinement is the dominant ablation factor; without the missing baselines it is impossible to attribute the lift to the framework structure rather than the specific multi-dimensional gating criteria or judge prompts.
[Ablation studies] Ablation studies (referenced in abstract): While the paper states that skipping caption refinement consistently degrades accuracy, it does not report the exact multi-dimensional quality gating criteria or the precise domain-specific rubric used by the LLM judge. These details are required to assess whether the reported improvements generalize or are artifacts of the chosen dimensions and rubric, directly affecting the claim of a generally superior training-free approach.

minor comments (2)

[Title / Abstract] Title vs. abstract: The title emphasizes 'Pest Diagnosis' while the abstract and results focus on crop disease classification; a brief clarification of scope (whether pests and diseases are handled identically or separately) would improve precision.
[Methods / Experiments] Model nomenclature: GPT-5-Nano and GPT-5-mini are used without explicit definition or capability assumptions; adding a short footnote or methods paragraph on these models would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate the suggested additions into the revised manuscript to strengthen the attribution of gains to the CPJ framework and improve reproducibility.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The central claim that the CPJ framework (caption refinement + judge selection) reliably outperforms direct image-to-answer prompting rests on the +22.7 pp gain, yet the manuscript provides no comparison against alternative training-free strategies (e.g., standard chain-of-thought or visual chain-of-thought) using the identical model pair. This omission is load-bearing because the abstract already states that caption refinement is the dominant ablation factor; without the missing baselines it is impossible to attribute the lift to the framework structure rather than the specific multi-dimensional gating criteria or judge prompts.

Authors: We acknowledge that additional head-to-head comparisons with other training-free prompting methods would more clearly isolate the contribution of the caption refinement and judge selection steps. Our no-caption baseline already represents standard direct image-to-answer prompting, and the ablations show caption refinement as the largest single factor. To address the concern, the revised manuscript will include chain-of-thought and visual chain-of-thought baselines using the same model pairs (GPT-5-Nano with GPT-5-mini, and the Qwen-VL-Chat pair) so that readers can directly compare the framework structure against these alternatives. revision: yes
Referee: [Ablation studies] Ablation studies (referenced in abstract): While the paper states that skipping caption refinement consistently degrades accuracy, it does not report the exact multi-dimensional quality gating criteria or the precise domain-specific rubric used by the LLM judge. These details are required to assess whether the reported improvements generalize or are artifacts of the chosen dimensions and rubric, directly affecting the claim of a generally superior training-free approach.

Authors: We agree that explicit reporting of the gating dimensions and judge rubric is necessary for assessing generalizability and reproducibility. These details are present in the public code repository and supplementary materials, but we will expand the main text of the revised manuscript to describe the exact multi-dimensional quality gating criteria (including the specific dimensions, quality thresholds, and iteration rules) and the full domain-specific rubric employed by the LLM judge. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with explicit baselines

full rationale

The paper presents Agri-CPJ as a training-free prompting framework that chains existing pre-trained VLMs for caption generation (with multi-dimensional gating) and LLMs for judge-based selection, then validates it via direct ablations and comparisons to no-caption baselines on CDDMBench and AgMMU-MCQs. No equations, derivations, or self-referential definitions appear; reported gains (+22.7 pp classification, +19.5 QA) are framed as measured outcomes rather than constructed equivalences. No load-bearing self-citations or ansatz smuggling are present in the provided text. The derivation chain is therefore self-contained and externally falsifiable through the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the reliability of off-the-shelf VLMs for morphological captioning and LLMs for domain-specific judging; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Large vision-language models can produce accurate, structured morphological captions from field photographs when given appropriate prompts and quality gates.
This is required for the first stage of the pipeline to supply useful input to the judge.
domain assumption An LLM can select the stronger of two diagnostic answers using domain-specific criteria without task-specific training.
This underpins the final selection step and the claim of improved accuracy.

pith-pipeline@v0.9.0 · 5612 in / 1504 out tokens · 89226 ms · 2026-05-08T06:24:10.842842+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis
cs.MA 2026-05 unverdicted novelty 6.0

A new 839K-image plant disease dataset paired with an agentic visual reasoning system that uses source-grounded symptoms raises diagnosis accuracy by 16.2 points on average and generalizes to unseen crops without retraining.

Reference graph

Works this paper leans on

34 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Leveragingvisionlanguagemodelsforspecializedagriculturaltasks,in:2025IEEE/CVFWinterConferenceonApplicationsofComputer Vision (WACV), IEEE. pp. 6320–6329. Badgujar,C.M.,Poulose,A.,Gan,H.,2024. Agriculturalobjectdetectionwithyouonlylookonce(yolo)algorithm:Abibliometricandsystematic literature review. Computers and Electronics in Agriculture 223, 109090. Bai...

work page arXiv 2024
[2]

arXiv preprint arXiv:2010.11929

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Duro, D.C., Franklin, S.E., Dubé, M.G., 2012. A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 hrg imagery. Remote sensing of envir...

work page doi:10.1109/icassp55912.2026.11464859 2010
[3]

Focus on describing the plant’s morphology, color, and overall condition
[4]

If disease symptoms are present, describe their appearance: color, shape, distribution, size, quantity, and extent
[5]

If no disease is visible, state that the plant appears healthy
[6]

Assess the severity and stage of any symptoms based on visual cues
[7]

Keep the description concise (90-100 words)
[8]

unable to describe clearly

If uncertain about features, indicate "unable to describe clearly" or "need more images". ## Output Format {"image_caption": "Description of the plant’s visual features and any disease symptoms, including morphology, color, distribution, size, and condition, without naming the plant or disease."} Threefew-shotexemplarsaccompanytheprompt,coveringfungalpust...
[9]

Base answers solely on image, caption, and question
[10]

Prioritize scientific accuracy
[11]

Never return empty answers
[12]

BOTH answers must include BOTH plant type and disease type
[13]

Answer1: Focus on PEST/DISEASE identification (symptoms, severity, features)
[14]

Answer2: Focus on CROP identification (type, variety, morphology)
[15]

Is this crop diseased?

Both answers should be scientifically accurate and detailed Thehuman-turntemplateis:Background(image_caption): {image_caption}\nQuestion: {question}. Five few-shot examples are included; one representative case (apple Alternaria blotch, question: “Is this crop diseased?”)yields:Answer1identifyingAlternariaBlotchwithcircularbrownlesions(2–5mm)andyellowishh...
[16]

Extract disease context from background Q&A and question
[17]

Diagnose and describe disease mechanisms, signs/symptoms, and disease cycles; include differential diagnosis
[18]

Translate product names to active ingredients; handle dilutions and rates precisely (metric units)
[19]

Give practical, stage-specific recommendations (seed/ seedling, vegetative, reproductive), including timings, intervals, and number of applications
[20]

Integrate IPM: resistant varieties, sanitation, crop rotation, canopy management, balanced fertilization (N-P-K-Si), irrigation and drainage, environmental Anonymous:Preprint submitted to ElsevierPage 20 of 24 Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis control (temp/RH/ventilation)
[21]

Address resistance management: rotate modes of action (e.g., FRAC for fungicides, IRAC for insecticides)
[22]

## Rules (abridged)

Ensure scientific accuracy, safety, and applicability. ## Rules (abridged)
[23]

Base answers on provided background; supplement only with widely accepted general agronomic knowledge if needed
[24]

If specific data (dose, interval, temperature threshold) are provided in the background, use them verbatim
[25]

Include active ingredient and formulation when available (e.g., 20% tebuconazole EC); provide dilution ratios, spray volume, timing, interval, and number of sprays
[26]

For cultural practices, include actionable details: balanced N-P-K, silicon/potash, drainage, plant spacing, pruning for airflow, temperature/RH targets
[27]

Rotate fungicides with different FRAC codes; avoid more than 2 consecutive applications of the same MoA
[28]

Answer2: Disease explanation (symptoms, causes/etiology, disease cycle/epidemiology, conducive conditions)

Provide TWO answers: Answer1: Treatment, prevention, and control (step-by-step IPM with specific methods, timings, dosages, intervals). Answer2: Disease explanation (symptoms, causes/etiology, disease cycle/epidemiology, conducive conditions)
[29]

Five few-shot examples are included

Advise PPE, follow local labels, observe pre-harvest intervals (PHI) and re-entry intervals (REI). Five few-shot examples are included. Table 5 shows one representative case: Table 5: Representative few-shot example for knowledge QA (Stage 2). Input Output Caption:Plant: Wheat; Disease: Leaf Rust... Question:What control techniques are applicable to Wheat...
[30]

Accuracy of Plant Identification: Correct identification of crop species
[31]

Accuracy of Disease/Pest Identification: Correct identification of disease or pest
[32]

Symptom Description Accuracy: Precise description of Anonymous:Preprint submitted to ElsevierPage 21 of 24 Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis disease symptoms
[33]

Adherence to Required Format: Proper structure with plant and disease identification
[34]

choice": 1 or 2,

Completeness and Professionalism: Comprehensive and scientifically sound response ## Task: Compare Answer 1 and Answer 2 based on the above criteria and select the better one. ## Output Format: { "choice": 1 or 2, "reason": "Brief explanation for your choice", "scores": { "answer1": { "plant_accuracy": 0-1, "disease_accuracy": 0-1, "symptom_accuracy": 0-1...