arxiv: 2512.15977 · v3 · submitted 2025-12-17 · 💻 cs.CV

Recognition: no theorem link

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario , Mason J. Earles

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelszero-shot classificationagricultural imagingplant disease detectionAgML benchmarkssupervised baselinesprompting strategiesmodel evaluation

0 comments

The pith

Vision-language models substantially underperform supervised classifiers in zero-shot agricultural recognition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current vision-language models can serve as zero-shot replacements for supervised classifiers on agricultural images. It runs a broad set of open and closed VLMs on 27 datasets from the AgML collection that cover plant disease, pest damage, and species identification across 162 classes and 248,000 images. Results show that even the strongest model reaches only about 62 percent average accuracy under multiple-choice prompting and much less under open-ended prompting, while a task-specific supervised baseline consistently scores higher. Prompt style and the use of LLM semantic judging change both absolute scores and model rankings. The work concludes that off-the-shelf VLMs are not ready for standalone use but can serve as assistive tools when given constrained interfaces and explicit label sets.

Core claim

Zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11) across 27 agricultural image classification datasets spanning 162 classes. The best VLM achieves approximately 62% average accuracy under multiple-choice prompting, while open-ended prompting yields accuracies typically below 25%. LLM-based semantic judging increases open-ended accuracy and alters rankings, while task-level analysis shows plant and weed species classification remains easier than pest and damage identification.

What carries the argument

Zero-shot benchmarking of diverse VLMs against YOLO11 on the AgML agricultural image datasets under multiple-choice and open-ended prompting with optional LLM semantic judging.

Load-bearing premise

The 27 AgML datasets and the tested prompting strategies are representative enough of real-world agricultural decision support to support conclusions about VLM readiness.

What would settle it

A new zero-shot VLM evaluation on a broad, previously unseen agricultural dataset where the model matches or exceeds YOLO11 accuracy across disease, pest, and species tasks would falsify the central claim.

read the original abstract

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks zero-shot performance of multiple open- and closed-source vision-language models (VLMs) on 27 agricultural image classification datasets from the AgML collection (162 classes, 248k images) spanning plant disease, pest/damage, and species/weed identification. It compares results under multiple-choice prompting, open-ended prompting, and LLM-based semantic judging against a supervised YOLO11 baseline, reporting that the best VLM (Gemini-3 Pro) reaches ~62% average accuracy while supervised performance is markedly higher; open-ended prompting yields <25% raw accuracy, improved to ~30% with judging. Task-level results show species classification easier than pest/damage. The authors conclude current off-the-shelf VLMs are unsuitable as standalone replacements but viable as assistive components with constrained interfaces and domain-aware evaluation.

Significance. If the empirical results hold, the work is significant for agricultural computer vision by supplying concrete, large-scale accuracy figures across 27 datasets, multiple VLMs, and prompting variants, directly quantifying the gap versus supervised models like YOLO11. It credits the inclusion of both closed-source (e.g., Gemini-3 Pro) and open-source (e.g., Qwen-VL-72B) models plus the demonstration that evaluation methodology alters rankings. This provides actionable guidance for practitioners on hybrid use cases. The purely empirical nature with direct measurements and no circular derivations strengthens the contribution, though impact depends on addressing dataset representativeness for real-world generalization.

major comments (2)

[Dataset description and experimental setup] The central claim that VLMs 'are not yet suitable as standalone agricultural diagnostic systems' rests on the 27 AgML datasets being representative of real-world decision support. The manuscript provides no analysis of dataset characteristics (e.g., image quality, lighting variability, occlusions, or field noise levels) versus typical deployment conditions, nor does it test open-vocabulary or unconstrained queries. This is load-bearing because task-level differences (species easier than pest/damage) already indicate sensitivity to category distribution; without such checks, the 62% vs. markedly higher YOLO11 gap may not generalize.
[Results on open-ended prompting and LLM judging] In the prompting and evaluation results, LLM-based semantic judging is reported to raise open-ended accuracy (e.g., ~21% to ~30% for top models) and change model rankings, yet no details are given on the judging prompt, validation against human labels, or inter-judge agreement. This methodological choice directly affects the open-ended performance numbers and the overall conclusion about prompting strategies.

minor comments (2)

[Abstract] The abstract states 'approximately 62% average accuracy' for Gemini-3 Pro; report the precise value, the exact number of datasets/tasks averaged, and standard deviation to improve precision.
[Experimental details] Ensure all model versions (e.g., Gemini-3 Pro, Qwen-VL-72B) and YOLO11 training hyperparameters are fully specified in the experimental details for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating revisions where the manuscript will be updated to improve clarity and transparency.

read point-by-point responses

Referee: The central claim that VLMs 'are not yet suitable as standalone agricultural diagnostic systems' rests on the 27 AgML datasets being representative of real-world decision support. The manuscript provides no analysis of dataset characteristics (e.g., image quality, lighting variability, occlusions, or field noise levels) versus typical deployment conditions, nor does it test open-vocabulary or unconstrained queries. This is load-bearing because task-level differences (species easier than pest/damage) already indicate sensitivity to category distribution; without such checks, the 62% vs. markedly higher YOLO11 gap may not generalize.

Authors: We agree that dataset representativeness is important for interpreting the generalizability of the zero-shot results. The AgML collection aggregates datasets from diverse agricultural sources that inherently include variations in lighting, occlusions, and image quality typical of field conditions. We did not include a quantitative meta-analysis comparing these characteristics to a broader set of operational deployment scenarios or perform additional open-vocabulary experiments. In revision we will expand the dataset description section with available metadata on imaging conditions and add an explicit limitations paragraph noting the absence of unconstrained query testing and the potential sensitivity to real-world noise distributions beyond the benchmark. We maintain that the reported gap versus YOLO11 holds on these established datasets but will clarify that broader operational validation remains future work. revision: partial
Referee: In the prompting and evaluation results, LLM-based semantic judging is reported to raise open-ended accuracy (e.g., ~21% to ~30% for top models) and change model rankings, yet no details are given on the judging prompt, validation against human labels, or inter-judge agreement. This methodological choice directly affects the open-ended performance numbers and the overall conclusion about prompting strategies.

Authors: We acknowledge that the current manuscript lacks sufficient methodological detail on the LLM judging procedure. In the revised version we will add the complete judging prompt, describe its design rationale, and report validation results obtained by comparing LLM judgments on a held-out subset against human expert annotations, including agreement statistics. These additions will allow readers to assess the reliability of the reported accuracy lift and ranking changes under open-ended prompting. revision: yes

standing simulated objections not resolved

Quantitative comparison of AgML image characteristics against a comprehensive survey of typical real-world agricultural deployment conditions, as this would require new data collection outside the scope of the existing benchmark study.

Circularity Check

0 steps flagged

No significant circularity: pure empirical benchmark with direct accuracy measurements

full rationale

The paper conducts an empirical evaluation of VLMs versus YOLO11 on 27 AgML datasets, reporting measured accuracies under different prompting regimes. No derivations, equations, fitted parameters, or predictions appear in the abstract or described methodology. All claims reduce to observed performance gaps rather than self-referential constructions, self-citations as load-bearing premises, or ansatzes. The representativeness of the datasets is a generalizability question, not a circularity issue. This is a standard non-circular benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning evaluation assumptions rather than new postulates or fitted parameters.

axioms (2)

domain assumption The AgML collection of 27 datasets adequately represents the range of agricultural visual classification tasks
Invoked when generalizing from the benchmark results to the broader claim about agricultural diagnostic systems.
domain assumption Zero-shot performance under the tested prompting regimes is the relevant metric for assessing readiness to replace supervised models
Central to the comparison; the paper does not evaluate fine-tuning or few-shot regimes.

pith-pipeline@v0.9.0 · 5575 in / 1254 out tokens · 19958 ms · 2026-05-16T21:12:48.953490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 12 internal anchors

[1]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Rémi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and Platen, Patrick von and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Scao, Teven Le and Gugger, Sylvain and Drame, M...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.03771 1910
[2]

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , month = feb, year =. Learning. doi:10.48550/arXiv.2103.00020 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020
[3]

Advances in Neural Information Processing Systems , author =

Flamingo: a. Advances in Neural Information Processing Systems , author =. 2022 , pages =

work page 2022
[4]

2024 , file =

LLaVA , author =. 2024 , file =

work page 2024
[5]

Plant disease classification in the wild using vision transformers and mixture of experts , volume =

Salman, Zafar and Muhammad, Abdullah and Han, Dongil , month = jun, year =. Plant disease classification in the wild using vision transformers and mixture of experts , volume =. Frontiers in Plant Science , publisher =. doi:10.3389/fpls.2025.1522985 , abstract =

work page doi:10.3389/fpls.2025.1522985 2025
[6]

An interpretable crop leaf disease and pest identification model based on prototypical part network and contrastive learning , volume =

Jia, Bingjing and Zeng, Jinyu and Zheng, Zhiwei and Ge, Hua and Song, Chenguang , month = nov, year =. An interpretable crop leaf disease and pest identification model based on prototypical part network and contrastive learning , volume =. Scientific Reports , publisher =. doi:10.1038/s41598-025-22521-1 , abstract =

work page doi:10.1038/s41598-025-22521-1
[7]

Scientific Reports , publisher =

Coleman, Guy and Salter, William and Walsh, Michael , month = jan, year =. Scientific Reports , publisher =. doi:10.1038/s41598-021-03858-9 , abstract =

work page doi:10.1038/s41598-021-03858-9
[8]

Plant Phenomics , author =

Standardizing and. Plant Phenomics , author =. 2023 , pages =. doi:10.34133/plantphenomics.0084 , language =

work page doi:10.34133/plantphenomics.0084 2023
[9]

Jain, Naitik and Joshi, Amogh and Earles, Mason , file =

work page
[10]

Computers and Electronics in Agriculture , author =

Deep learning in agriculture:. Computers and Electronics in Agriculture , author =. 2018 , keywords =. doi:10.1016/j.compag.2018.02.016 , abstract =

work page doi:10.1016/j.compag.2018.02.016 2018
[11]

Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas , month = sep, year =. Sigmoid. doi:10.48550/arXiv.2303.15343 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.15343
[12]

Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , month = oct, year =. Qwen-. doi:10.48550/arXiv.2308.12966 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966
[13]

and Singh, Arti and Hegde, Chinmay and Ganapathysubramanian, Baskar and Balu, Aditya and Krishnamurthy, Adarsh and Sarkar, Soumik , month = mar, year =

Arshad, Muhammad Arbab and Jubery, Talukder Zaki and Roy, Tirtho and Nassiri, Rim and Singh, Asheesh K. and Singh, Arti and Hegde, Chinmay and Ganapathysubramanian, Baskar and Balu, Aditya and Krishnamurthy, Adarsh and Sarkar, Soumik , month = mar, year =. Leveraging. doi:10.48550/arXiv.2407.19617 , abstract =

work page doi:10.48550/arxiv.2407.19617
[14]

doi:10.48550/arXiv.2507.20519 , abstract =

Shinoda, Risa and Inoue, Nakamasa and Kataoka, Hirokatsu and Onishi, Masaki and Ushiku, Yoshitaka , month = jul, year =. doi:10.48550/arXiv.2507.20519 , abstract =

work page doi:10.48550/arxiv.2507.20519
[15]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and Hénaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua , month = feb, year =. doi:10.48550/arXiv.2502.14786 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786
[16]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan , month = jul, year =. doi:10.48550/arXiv.2407.07895 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07895
[17]

Team, Gemma and Kamath , month = mar, year =. Gemma 3. doi:10.48550/arXiv.2503.19786 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786
[18]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, Haoyu and Liu, Wen and Zhang, Bo and Wang, Bingxuan and Dong, Kai and Liu, Bo and Sun, Jingxiang and Ren, Tongzheng and Li, Zhuoshu and Yang, Hao and Sun, Yaofeng and Deng, Chengqi and Xu, Hanwei and Xie, Zhenda and Ruan, Chong , month = mar, year =. doi:10.48550/arXiv.2403.05525 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525
[19]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI and Agarwal , month = aug, year =. gpt-oss-120b & gpt-oss-20b. doi:10.48550/arXiv.2508.10925 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925
[20]

OpenAI GPT-5 System Card

OpenAI and Singh , month = dec, year =. doi:10.48550/arXiv.2601.03267 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
[21]

Gemini: [Gemini 3 Pro] , year =

work page
[22]

Claude Haiku 4.5 , year =

work page
[23]

YOLOv11: An Overview of the Key Architectural Enhancements

Khanam, Rahima and Hussain, Muhammad , month = oct, year =. doi:10.48550/arXiv.2410.17725 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17725
[24]

Nexus , author =

Generalized domain prompt learning for accessible scientific vision-language models , volume =. Nexus , author =. 2025 , keywords =. doi:10.1016/j.ynexs.2025.100069 , abstract =

work page doi:10.1016/j.ynexs.2025.100069 2025
[25]

and Merle, David A

Cheng, Zijie and Ong, Ariel Yuhan and Wagner, Siegfried K. and Merle, David A. and Ju, Lie and Zhang, Hanyuan and Chen, Ruinian and Pang, Linze and Li, Boxuan and He, Tiantian and Ran, Anran and Jiang, Hongyang and Yang, Dawei Gabriel and Zou, Ke and Goh, Jocelyn Hui Lin and Srinivasan, Sahana and Altmann, Andre and Alexander, Daniel C. and Cheung, Carol ...

work page doi:10.1038/s41746-025-02108-w
[26]

Artificial Intelligence Review , author =

From concept to manufacturing: evaluating vision-language models for engineering design , volume =. Artificial Intelligence Review , author =. 2025 , pages =. doi:10.1007/s10462-025-11290-y , abstract =

work page doi:10.1007/s10462-025-11290-y 2025
[27]

Harnessing large vision and language models in agriculture: a review , volume =

Zhu, Hongyan and Qin, Shuai and Su, Min and Lin, Chengzhi and Li, Anjie and Gao, Junfeng , month = sep, year =. Harnessing large vision and language models in agriculture: a review , volume =. Frontiers in Plant Science , publisher =. doi:10.3389/fpls.2025.1579355 , abstract =

work page doi:10.3389/fpls.2025.1579355 2025
[28]

Computers and Electronics in Agriculture , author =

Evaluating and deploying. Computers and Electronics in Agriculture , author =. 2025 , keywords =. doi:10.1016/j.compag.2025.110806 , abstract =

work page doi:10.1016/j.compag.2025.110806 2025
[29]

Computers and Electronics in Agriculture , author =

From one field to another—. Computers and Electronics in Agriculture , author =. 2023 , keywords =. doi:10.1016/j.compag.2023.108114 , abstract =

work page doi:10.1016/j.compag.2023.108114 2023
[30]

Unsupervised

Zhang, Junbo and Xu, Shifeng and Sun, Jun and Ou, Dinghua and Wu, Xiaobo and Wang, Mantao , month = dec, year =. Unsupervised. Remote Sensing , publisher =. doi:10.3390/rs14246298 , abstract =

work page doi:10.3390/rs14246298
[31]

Computers and Electronics in Agriculture , author =

Enhancing grain moisture prediction in multiple crop seasons using domain adaptation. Computers and Electronics in Agriculture , author =. 2025 , keywords =. doi:10.1016/j.compag.2025.110058 , abstract =

work page doi:10.1016/j.compag.2025.110058 2025
[32]

IEEE Transactions on Pattern Analysis and Machine Intelligence , author =

Foundation. IEEE Transactions on Pattern Analysis and Machine Intelligence , author =. 2025 , keywords =. doi:10.1109/TPAMI.2024.3506283 , abstract =

work page doi:10.1109/tpami.2024.3506283 2025
[33]

IEEE Transactions on Pattern Analysis and Machine Intelligence , author =

Vision-. IEEE Transactions on Pattern Analysis and Machine Intelligence , author =. 2024 , keywords =. doi:10.1109/TPAMI.2024.3369699 , abstract =

work page doi:10.1109/tpami.2024.3369699 2024
[34]

Computer

Min, Xirun and Ye, Yuwen and Xiong, Shuming and Chen, Xiao , month = jul, year =. Computer. Applied Sciences , publisher =. doi:10.3390/app15147663 , abstract =

work page doi:10.3390/app15147663
[35]

Artificial Intelligence Review , author =

Deep learning and computer vision in plant disease detection: a comprehensive review of techniques, models, and trends in precision agriculture , volume =. Artificial Intelligence Review , author =. 2025 , keywords =. doi:10.1007/s10462-024-11100-x , abstract =

work page doi:10.1007/s10462-024-11100-x 2025
[36]

Journal of Agriculture and Food Research , author =

Artificial intelligence in agriculture:. Journal of Agriculture and Food Research , author =. 2025 , pages =. doi:10.1016/j.jafr.2025.101762 , abstract =

work page doi:10.1016/j.jafr.2025.101762 2025
[37]

Plant Science , author =

What is cost-efficient phenotyping?. Plant Science , author =. 2019 , pages =. doi:10.1016/j.plantsci.2018.06.015 , abstract =

work page doi:10.1016/j.plantsci.2018.06.015 2019
[38]

Hu, Xing and Chen, Siyuan and Duan, Qianqian and Ahn, Choon Ki and Shang, Huiliang and Zhang, Dawei , month = aug, year =. Domain. doi:10.48550/arXiv.2506.05972 , abstract =

work page doi:10.48550/arxiv.2506.05972
[39]

Theoretical and Applied Genetics , author =

Integration of crop modeling and sensing into molecular breeding for nutritional quality and stress tolerance , volume =. Theoretical and Applied Genetics , author =. 2025 , pages =. doi:10.1007/s00122-025-04984-y , abstract =

work page doi:10.1007/s00122-025-04984-y 2025
[40]

Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang , editor =. A. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.64 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.64 2024
[41]

Gu, Jindong and Han, Zhen and Chen, Shuo and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip , month = jul, year =. A. doi:10.48550/arXiv.2307.12980 , abstract =

work page doi:10.48550/arxiv.2307.12980
[42]

Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631