Recognition: unknown
Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge
Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3
The pith
Refined morphological captions plus an LLM judge raise agricultural disease diagnosis accuracy by over 20 points without any training or fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By inserting an intermediate, quality-gated morphological caption between the image and the diagnostic question, and then letting an LLM judge select between two complementary answers, the framework produces both higher accuracy and an explicit audit trail of which visual observations supported the final diagnosis. Ablations show that removing the caption-refinement step consistently lowers performance across tested models, while the full pipeline reaches 77.84 percent on a multiple-choice agricultural benchmark with GPT-5-Nano and remains competitive with larger open models despite the format change.
What carries the argument
The Caption-Prompt-Judge loop: a vision-language model first produces a structured morphological caption that is iteratively refined by multi-dimensional quality gates, after which two candidate diagnoses are generated and an LLM judge selects the better one using domain-specific criteria.
If this is right
- Caption refinement is the dominant driver of the reported gains, so any implementation can start by adding only that step.
- The generated caption and judge rationale together give a practitioner a concrete list of visual observations to verify or contest.
- The method works across different vision-language models without task-specific retraining.
- Performance holds when the output format changes from open-ended answers to multiple-choice questions.
- No fine-tuning or additional labeled data is required beyond the base models.
Where Pith is reading between the lines
- The same caption-plus-judge pattern could be tested on other visual identification tasks such as plant nutrient deficiency or livestock health where field photos are common.
- Because the framework depends on the quality of the initial caption generator, models with weaker visual grounding may see smaller or negative gains.
- The audit trail could be extended to let the judge also flag low-confidence observations so a human can request a second photo.
- Cost and latency scale with the number of caption refinements and judge calls, which may limit real-time mobile use on older devices.
Load-bearing premise
That the multi-dimensional quality checks on the generated captions and the domain-specific rules used by the LLM judge will reliably produce better inputs and selections than asking the model to answer directly from the raw image.
What would settle it
Running the same models on CDDMBench but with the quality-gating step disabled and finding that accuracy drops back to or below the no-caption baseline.
Figures
read the original abstract
Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84\% and Qwen-VL-Chat reached 64.54\%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Agri-CPJ, a training-free framework for crop disease and pest diagnosis from field images. A VLM generates structured morphological captions that are iteratively refined via multi-dimensional quality gating; two candidate diagnostic responses are then produced from complementary viewpoints and an LLM judge selects the stronger one using domain-specific criteria. The paper reports that caption refinement has the largest ablation impact, with GPT-5-Nano plus GPT-5-mini captions yielding +22.7 pp classification accuracy and +19.5 points QA score on CDDMBench over no-caption baselines, plus competitive results (77.84% and 64.54%) on AgMMU-MCQs. The caption and judge rationale are presented as an interpretable audit trail, with code and data released publicly.
Significance. If the gains prove robust and generalizable beyond the specific gating dimensions and judge rubric, the work would supply a practical, training-free route to explainable agricultural diagnosis that leverages existing VLMs without fine-tuning. The public code release is a clear strength that supports reproducibility and follow-on work. The significance is currently limited by the absence of head-to-head comparisons against other training-free prompting baselines, which leaves open whether the reported lifts are framework-driven or tied to the particular implementation choices.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: The central claim that the CPJ framework (caption refinement + judge selection) reliably outperforms direct image-to-answer prompting rests on the +22.7 pp gain, yet the manuscript provides no comparison against alternative training-free strategies (e.g., standard chain-of-thought or visual chain-of-thought) using the identical model pair. This omission is load-bearing because the abstract already states that caption refinement is the dominant ablation factor; without the missing baselines it is impossible to attribute the lift to the framework structure rather than the specific multi-dimensional gating criteria or judge prompts.
- [Ablation studies] Ablation studies (referenced in abstract): While the paper states that skipping caption refinement consistently degrades accuracy, it does not report the exact multi-dimensional quality gating criteria or the precise domain-specific rubric used by the LLM judge. These details are required to assess whether the reported improvements generalize or are artifacts of the chosen dimensions and rubric, directly affecting the claim of a generally superior training-free approach.
minor comments (2)
- [Title / Abstract] Title vs. abstract: The title emphasizes 'Pest Diagnosis' while the abstract and results focus on crop disease classification; a brief clarification of scope (whether pests and diseases are handled identically or separately) would improve precision.
- [Methods / Experiments] Model nomenclature: GPT-5-Nano and GPT-5-mini are used without explicit definition or capability assumptions; adding a short footnote or methods paragraph on these models would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will incorporate the suggested additions into the revised manuscript to strengthen the attribution of gains to the CPJ framework and improve reproducibility.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The central claim that the CPJ framework (caption refinement + judge selection) reliably outperforms direct image-to-answer prompting rests on the +22.7 pp gain, yet the manuscript provides no comparison against alternative training-free strategies (e.g., standard chain-of-thought or visual chain-of-thought) using the identical model pair. This omission is load-bearing because the abstract already states that caption refinement is the dominant ablation factor; without the missing baselines it is impossible to attribute the lift to the framework structure rather than the specific multi-dimensional gating criteria or judge prompts.
Authors: We acknowledge that additional head-to-head comparisons with other training-free prompting methods would more clearly isolate the contribution of the caption refinement and judge selection steps. Our no-caption baseline already represents standard direct image-to-answer prompting, and the ablations show caption refinement as the largest single factor. To address the concern, the revised manuscript will include chain-of-thought and visual chain-of-thought baselines using the same model pairs (GPT-5-Nano with GPT-5-mini, and the Qwen-VL-Chat pair) so that readers can directly compare the framework structure against these alternatives. revision: yes
-
Referee: [Ablation studies] Ablation studies (referenced in abstract): While the paper states that skipping caption refinement consistently degrades accuracy, it does not report the exact multi-dimensional quality gating criteria or the precise domain-specific rubric used by the LLM judge. These details are required to assess whether the reported improvements generalize or are artifacts of the chosen dimensions and rubric, directly affecting the claim of a generally superior training-free approach.
Authors: We agree that explicit reporting of the gating dimensions and judge rubric is necessary for assessing generalizability and reproducibility. These details are present in the public code repository and supplementary materials, but we will expand the main text of the revised manuscript to describe the exact multi-dimensional quality gating criteria (including the specific dimensions, quality thresholds, and iteration rules) and the full domain-specific rubric employed by the LLM judge. revision: yes
Circularity Check
No circularity: empirical framework with explicit baselines
full rationale
The paper presents Agri-CPJ as a training-free prompting framework that chains existing pre-trained VLMs for caption generation (with multi-dimensional gating) and LLMs for judge-based selection, then validates it via direct ablations and comparisons to no-caption baselines on CDDMBench and AgMMU-MCQs. No equations, derivations, or self-referential definitions appear; reported gains (+22.7 pp classification, +19.5 QA) are framed as measured outcomes rather than constructed equivalences. No load-bearing self-citations or ansatz smuggling are present in the provided text. The derivation chain is therefore self-contained and externally falsifiable through the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large vision-language models can produce accurate, structured morphological captions from field photographs when given appropriate prompts and quality gates.
- domain assumption An LLM can select the stronger of two diagnostic answers using domain-specific criteria without task-specific training.
Forward citations
Cited by 1 Pith paper
-
SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis
A new 839K-image plant disease dataset paired with an agentic visual reasoning system that uses source-grounded symptoms raises diagnosis accuracy by 16.2 points on average and generalizes to unseen crops without retraining.
Reference graph
Works this paper leans on
-
[1]
Leveragingvisionlanguagemodelsforspecializedagriculturaltasks,in:2025IEEE/CVFWinterConferenceonApplicationsofComputer Vision (WACV), IEEE. pp. 6320–6329. Badgujar,C.M.,Poulose,A.,Gan,H.,2024. Agriculturalobjectdetectionwithyouonlylookonce(yolo)algorithm:Abibliometricandsystematic literature review. Computers and Electronics in Agriculture 223, 109090. Bai...
-
[2]
arXiv preprint arXiv:2010.11929
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Duro, D.C., Franklin, S.E., Dubé, M.G., 2012. A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 hrg imagery. Remote sensing of envir...
-
[3]
Focus on describing the plant’s morphology, color, and overall condition
-
[4]
If disease symptoms are present, describe their appearance: color, shape, distribution, size, quantity, and extent
-
[5]
If no disease is visible, state that the plant appears healthy
-
[6]
Assess the severity and stage of any symptoms based on visual cues
-
[7]
Keep the description concise (90-100 words)
-
[8]
unable to describe clearly
If uncertain about features, indicate "unable to describe clearly" or "need more images". ## Output Format {"image_caption": "Description of the plant’s visual features and any disease symptoms, including morphology, color, distribution, size, and condition, without naming the plant or disease."} Threefew-shotexemplarsaccompanytheprompt,coveringfungalpust...
-
[9]
Base answers solely on image, caption, and question
-
[10]
Prioritize scientific accuracy
-
[11]
Never return empty answers
-
[12]
BOTH answers must include BOTH plant type and disease type
-
[13]
Answer1: Focus on PEST/DISEASE identification (symptoms, severity, features)
-
[14]
Answer2: Focus on CROP identification (type, variety, morphology)
-
[15]
Is this crop diseased?
Both answers should be scientifically accurate and detailed Thehuman-turntemplateis:Background(image_caption): {image_caption}\nQuestion: {question}. Five few-shot examples are included; one representative case (apple Alternaria blotch, question: “Is this crop diseased?”)yields:Answer1identifyingAlternariaBlotchwithcircularbrownlesions(2–5mm)andyellowishh...
-
[16]
Extract disease context from background Q&A and question
-
[17]
Diagnose and describe disease mechanisms, signs/symptoms, and disease cycles; include differential diagnosis
-
[18]
Translate product names to active ingredients; handle dilutions and rates precisely (metric units)
-
[19]
Give practical, stage-specific recommendations (seed/ seedling, vegetative, reproductive), including timings, intervals, and number of applications
-
[20]
Integrate IPM: resistant varieties, sanitation, crop rotation, canopy management, balanced fertilization (N-P-K-Si), irrigation and drainage, environmental Anonymous:Preprint submitted to ElsevierPage 20 of 24 Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis control (temp/RH/ventilation)
-
[21]
Address resistance management: rotate modes of action (e.g., FRAC for fungicides, IRAC for insecticides)
-
[22]
## Rules (abridged)
Ensure scientific accuracy, safety, and applicability. ## Rules (abridged)
-
[23]
Base answers on provided background; supplement only with widely accepted general agronomic knowledge if needed
-
[24]
If specific data (dose, interval, temperature threshold) are provided in the background, use them verbatim
-
[25]
Include active ingredient and formulation when available (e.g., 20% tebuconazole EC); provide dilution ratios, spray volume, timing, interval, and number of sprays
-
[26]
For cultural practices, include actionable details: balanced N-P-K, silicon/potash, drainage, plant spacing, pruning for airflow, temperature/RH targets
-
[27]
Rotate fungicides with different FRAC codes; avoid more than 2 consecutive applications of the same MoA
-
[28]
Answer2: Disease explanation (symptoms, causes/etiology, disease cycle/epidemiology, conducive conditions)
Provide TWO answers: Answer1: Treatment, prevention, and control (step-by-step IPM with specific methods, timings, dosages, intervals). Answer2: Disease explanation (symptoms, causes/etiology, disease cycle/epidemiology, conducive conditions)
-
[29]
Five few-shot examples are included
Advise PPE, follow local labels, observe pre-harvest intervals (PHI) and re-entry intervals (REI). Five few-shot examples are included. Table 5 shows one representative case: Table 5: Representative few-shot example for knowledge QA (Stage 2). Input Output Caption:Plant: Wheat; Disease: Leaf Rust... Question:What control techniques are applicable to Wheat...
-
[30]
Accuracy of Plant Identification: Correct identification of crop species
-
[31]
Accuracy of Disease/Pest Identification: Correct identification of disease or pest
-
[32]
Symptom Description Accuracy: Precise description of Anonymous:Preprint submitted to ElsevierPage 21 of 24 Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis disease symptoms
-
[33]
Adherence to Required Format: Proper structure with plant and disease identification
-
[34]
choice": 1 or 2,
Completeness and Professionalism: Comprehensive and scientifically sound response ## Task: Compare Answer 1 and Answer 2 based on the above criteria and select the better one. ## Output Format: { "choice": 1 or 2, "reason": "Brief explanation for your choice", "scores": { "answer1": { "plant_accuracy": 0-1, "disease_accuracy": 0-1, "symptom_accuracy": 0-1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.