Recognition: no theorem link
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Pith reviewed 2026-05-16 21:12 UTC · model grok-4.3
The pith
Vision-language models substantially underperform supervised classifiers in zero-shot agricultural recognition tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11) across 27 agricultural image classification datasets spanning 162 classes. The best VLM achieves approximately 62% average accuracy under multiple-choice prompting, while open-ended prompting yields accuracies typically below 25%. LLM-based semantic judging increases open-ended accuracy and alters rankings, while task-level analysis shows plant and weed species classification remains easier than pest and damage identification.
What carries the argument
Zero-shot benchmarking of diverse VLMs against YOLO11 on the AgML agricultural image datasets under multiple-choice and open-ended prompting with optional LLM semantic judging.
Load-bearing premise
The 27 AgML datasets and the tested prompting strategies are representative enough of real-world agricultural decision support to support conclusions about VLM readiness.
What would settle it
A new zero-shot VLM evaluation on a broad, previously unseen agricultural dataset where the model matches or exceeds YOLO11 accuracy across disease, pest, and species tasks would falsify the central claim.
read the original abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks zero-shot performance of multiple open- and closed-source vision-language models (VLMs) on 27 agricultural image classification datasets from the AgML collection (162 classes, 248k images) spanning plant disease, pest/damage, and species/weed identification. It compares results under multiple-choice prompting, open-ended prompting, and LLM-based semantic judging against a supervised YOLO11 baseline, reporting that the best VLM (Gemini-3 Pro) reaches ~62% average accuracy while supervised performance is markedly higher; open-ended prompting yields <25% raw accuracy, improved to ~30% with judging. Task-level results show species classification easier than pest/damage. The authors conclude current off-the-shelf VLMs are unsuitable as standalone replacements but viable as assistive components with constrained interfaces and domain-aware evaluation.
Significance. If the empirical results hold, the work is significant for agricultural computer vision by supplying concrete, large-scale accuracy figures across 27 datasets, multiple VLMs, and prompting variants, directly quantifying the gap versus supervised models like YOLO11. It credits the inclusion of both closed-source (e.g., Gemini-3 Pro) and open-source (e.g., Qwen-VL-72B) models plus the demonstration that evaluation methodology alters rankings. This provides actionable guidance for practitioners on hybrid use cases. The purely empirical nature with direct measurements and no circular derivations strengthens the contribution, though impact depends on addressing dataset representativeness for real-world generalization.
major comments (2)
- [Dataset description and experimental setup] The central claim that VLMs 'are not yet suitable as standalone agricultural diagnostic systems' rests on the 27 AgML datasets being representative of real-world decision support. The manuscript provides no analysis of dataset characteristics (e.g., image quality, lighting variability, occlusions, or field noise levels) versus typical deployment conditions, nor does it test open-vocabulary or unconstrained queries. This is load-bearing because task-level differences (species easier than pest/damage) already indicate sensitivity to category distribution; without such checks, the 62% vs. markedly higher YOLO11 gap may not generalize.
- [Results on open-ended prompting and LLM judging] In the prompting and evaluation results, LLM-based semantic judging is reported to raise open-ended accuracy (e.g., ~21% to ~30% for top models) and change model rankings, yet no details are given on the judging prompt, validation against human labels, or inter-judge agreement. This methodological choice directly affects the open-ended performance numbers and the overall conclusion about prompting strategies.
minor comments (2)
- [Abstract] The abstract states 'approximately 62% average accuracy' for Gemini-3 Pro; report the precise value, the exact number of datasets/tasks averaged, and standard deviation to improve precision.
- [Experimental details] Ensure all model versions (e.g., Gemini-3 Pro, Qwen-VL-72B) and YOLO11 training hyperparameters are fully specified in the experimental details for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating revisions where the manuscript will be updated to improve clarity and transparency.
read point-by-point responses
-
Referee: The central claim that VLMs 'are not yet suitable as standalone agricultural diagnostic systems' rests on the 27 AgML datasets being representative of real-world decision support. The manuscript provides no analysis of dataset characteristics (e.g., image quality, lighting variability, occlusions, or field noise levels) versus typical deployment conditions, nor does it test open-vocabulary or unconstrained queries. This is load-bearing because task-level differences (species easier than pest/damage) already indicate sensitivity to category distribution; without such checks, the 62% vs. markedly higher YOLO11 gap may not generalize.
Authors: We agree that dataset representativeness is important for interpreting the generalizability of the zero-shot results. The AgML collection aggregates datasets from diverse agricultural sources that inherently include variations in lighting, occlusions, and image quality typical of field conditions. We did not include a quantitative meta-analysis comparing these characteristics to a broader set of operational deployment scenarios or perform additional open-vocabulary experiments. In revision we will expand the dataset description section with available metadata on imaging conditions and add an explicit limitations paragraph noting the absence of unconstrained query testing and the potential sensitivity to real-world noise distributions beyond the benchmark. We maintain that the reported gap versus YOLO11 holds on these established datasets but will clarify that broader operational validation remains future work. revision: partial
-
Referee: In the prompting and evaluation results, LLM-based semantic judging is reported to raise open-ended accuracy (e.g., ~21% to ~30% for top models) and change model rankings, yet no details are given on the judging prompt, validation against human labels, or inter-judge agreement. This methodological choice directly affects the open-ended performance numbers and the overall conclusion about prompting strategies.
Authors: We acknowledge that the current manuscript lacks sufficient methodological detail on the LLM judging procedure. In the revised version we will add the complete judging prompt, describe its design rationale, and report validation results obtained by comparing LLM judgments on a held-out subset against human expert annotations, including agreement statistics. These additions will allow readers to assess the reliability of the reported accuracy lift and ranking changes under open-ended prompting. revision: yes
- Quantitative comparison of AgML image characteristics against a comprehensive survey of typical real-world agricultural deployment conditions, as this would require new data collection outside the scope of the existing benchmark study.
Circularity Check
No significant circularity: pure empirical benchmark with direct accuracy measurements
full rationale
The paper conducts an empirical evaluation of VLMs versus YOLO11 on 27 AgML datasets, reporting measured accuracies under different prompting regimes. No derivations, equations, fitted parameters, or predictions appear in the abstract or described methodology. All claims reduce to observed performance gaps rather than self-referential constructions, self-citations as load-bearing premises, or ansatzes. The representativeness of the datasets is a generalizability question, not a circularity issue. This is a standard non-circular benchmark study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The AgML collection of 27 datasets adequately represents the range of agricultural visual classification tasks
- domain assumption Zero-shot performance under the tested prompting regimes is the relevant metric for assessing readiness to replace supervised models
Reference graph
Works this paper leans on
-
[1]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Rémi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and Platen, Patrick von and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Scao, Teven Le and Gugger, Sylvain and Drame, M...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.03771 1910
-
[2]
Learning Transferable Visual Models From Natural Language Supervision
Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , month = feb, year =. Learning. doi:10.48550/arXiv.2103.00020 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020
-
[3]
Advances in Neural Information Processing Systems , author =
Flamingo: a. Advances in Neural Information Processing Systems , author =. 2022 , pages =
work page 2022
- [4]
-
[5]
Plant disease classification in the wild using vision transformers and mixture of experts , volume =
Salman, Zafar and Muhammad, Abdullah and Han, Dongil , month = jun, year =. Plant disease classification in the wild using vision transformers and mixture of experts , volume =. Frontiers in Plant Science , publisher =. doi:10.3389/fpls.2025.1522985 , abstract =
-
[6]
Jia, Bingjing and Zeng, Jinyu and Zheng, Zhiwei and Ge, Hua and Song, Chenguang , month = nov, year =. An interpretable crop leaf disease and pest identification model based on prototypical part network and contrastive learning , volume =. Scientific Reports , publisher =. doi:10.1038/s41598-025-22521-1 , abstract =
-
[7]
Scientific Reports , publisher =
Coleman, Guy and Salter, William and Walsh, Michael , month = jan, year =. Scientific Reports , publisher =. doi:10.1038/s41598-021-03858-9 , abstract =
-
[8]
Standardizing and. Plant Phenomics , author =. 2023 , pages =. doi:10.34133/plantphenomics.0084 , language =
-
[9]
Jain, Naitik and Joshi, Amogh and Earles, Mason , file =
-
[10]
Computers and Electronics in Agriculture , author =
Deep learning in agriculture:. Computers and Electronics in Agriculture , author =. 2018 , keywords =. doi:10.1016/j.compag.2018.02.016 , abstract =
-
[11]
Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas , month = sep, year =. Sigmoid. doi:10.48550/arXiv.2303.15343 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.15343
-
[12]
Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , month = oct, year =. Qwen-. doi:10.48550/arXiv.2308.12966 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966
-
[13]
Arshad, Muhammad Arbab and Jubery, Talukder Zaki and Roy, Tirtho and Nassiri, Rim and Singh, Asheesh K. and Singh, Arti and Hegde, Chinmay and Ganapathysubramanian, Baskar and Balu, Aditya and Krishnamurthy, Adarsh and Sarkar, Soumik , month = mar, year =. Leveraging. doi:10.48550/arXiv.2407.19617 , abstract =
-
[14]
doi:10.48550/arXiv.2507.20519 , abstract =
Shinoda, Risa and Inoue, Nakamasa and Kataoka, Hirokatsu and Onishi, Masaki and Ushiku, Yoshitaka , month = jul, year =. doi:10.48550/arXiv.2507.20519 , abstract =
-
[15]
Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and Hénaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua , month = feb, year =. doi:10.48550/arXiv.2502.14786 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786
-
[16]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan , month = jul, year =. doi:10.48550/arXiv.2407.07895 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07895
-
[17]
Team, Gemma and Kamath , month = mar, year =. Gemma 3. doi:10.48550/arXiv.2503.19786 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786
-
[18]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, Haoyu and Liu, Wen and Zhang, Bo and Wang, Bingxuan and Dong, Kai and Liu, Bo and Sun, Jingxiang and Ren, Tongzheng and Li, Zhuoshu and Yang, Hao and Sun, Yaofeng and Deng, Chengqi and Xu, Hanwei and Xie, Zhenda and Ruan, Chong , month = mar, year =. doi:10.48550/arXiv.2403.05525 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525
-
[19]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI and Agarwal , month = aug, year =. gpt-oss-120b & gpt-oss-20b. doi:10.48550/arXiv.2508.10925 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925
-
[20]
OpenAI and Singh , month = dec, year =. doi:10.48550/arXiv.2601.03267 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
-
[21]
Gemini: [Gemini 3 Pro] , year =
-
[22]
Claude Haiku 4.5 , year =
-
[23]
YOLOv11: An Overview of the Key Architectural Enhancements
Khanam, Rahima and Hussain, Muhammad , month = oct, year =. doi:10.48550/arXiv.2410.17725 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17725
-
[24]
Generalized domain prompt learning for accessible scientific vision-language models , volume =. Nexus , author =. 2025 , keywords =. doi:10.1016/j.ynexs.2025.100069 , abstract =
-
[25]
Cheng, Zijie and Ong, Ariel Yuhan and Wagner, Siegfried K. and Merle, David A. and Ju, Lie and Zhang, Hanyuan and Chen, Ruinian and Pang, Linze and Li, Boxuan and He, Tiantian and Ran, Anran and Jiang, Hongyang and Yang, Dawei Gabriel and Zou, Ke and Goh, Jocelyn Hui Lin and Srinivasan, Sahana and Altmann, Andre and Alexander, Daniel C. and Cheung, Carol ...
-
[26]
Artificial Intelligence Review , author =
From concept to manufacturing: evaluating vision-language models for engineering design , volume =. Artificial Intelligence Review , author =. 2025 , pages =. doi:10.1007/s10462-025-11290-y , abstract =
-
[27]
Harnessing large vision and language models in agriculture: a review , volume =
Zhu, Hongyan and Qin, Shuai and Su, Min and Lin, Chengzhi and Li, Anjie and Gao, Junfeng , month = sep, year =. Harnessing large vision and language models in agriculture: a review , volume =. Frontiers in Plant Science , publisher =. doi:10.3389/fpls.2025.1579355 , abstract =
-
[28]
Computers and Electronics in Agriculture , author =
Evaluating and deploying. Computers and Electronics in Agriculture , author =. 2025 , keywords =. doi:10.1016/j.compag.2025.110806 , abstract =
-
[29]
Computers and Electronics in Agriculture , author =
From one field to another—. Computers and Electronics in Agriculture , author =. 2023 , keywords =. doi:10.1016/j.compag.2023.108114 , abstract =
-
[30]
Zhang, Junbo and Xu, Shifeng and Sun, Jun and Ou, Dinghua and Wu, Xiaobo and Wang, Mantao , month = dec, year =. Unsupervised. Remote Sensing , publisher =. doi:10.3390/rs14246298 , abstract =
-
[31]
Computers and Electronics in Agriculture , author =
Enhancing grain moisture prediction in multiple crop seasons using domain adaptation. Computers and Electronics in Agriculture , author =. 2025 , keywords =. doi:10.1016/j.compag.2025.110058 , abstract =
-
[32]
IEEE Transactions on Pattern Analysis and Machine Intelligence , author =
Foundation. IEEE Transactions on Pattern Analysis and Machine Intelligence , author =. 2025 , keywords =. doi:10.1109/TPAMI.2024.3506283 , abstract =
-
[33]
IEEE Transactions on Pattern Analysis and Machine Intelligence , author =
Vision-. IEEE Transactions on Pattern Analysis and Machine Intelligence , author =. 2024 , keywords =. doi:10.1109/TPAMI.2024.3369699 , abstract =
-
[34]
Min, Xirun and Ye, Yuwen and Xiong, Shuming and Chen, Xiao , month = jul, year =. Computer. Applied Sciences , publisher =. doi:10.3390/app15147663 , abstract =
-
[35]
Artificial Intelligence Review , author =
Deep learning and computer vision in plant disease detection: a comprehensive review of techniques, models, and trends in precision agriculture , volume =. Artificial Intelligence Review , author =. 2025 , keywords =. doi:10.1007/s10462-024-11100-x , abstract =
-
[36]
Journal of Agriculture and Food Research , author =
Artificial intelligence in agriculture:. Journal of Agriculture and Food Research , author =. 2025 , pages =. doi:10.1016/j.jafr.2025.101762 , abstract =
-
[37]
What is cost-efficient phenotyping?. Plant Science , author =. 2019 , pages =. doi:10.1016/j.plantsci.2018.06.015 , abstract =
-
[38]
Hu, Xing and Chen, Siyuan and Duan, Qianqian and Ahn, Choon Ki and Shang, Huiliang and Zhang, Dawei , month = aug, year =. Domain. doi:10.48550/arXiv.2506.05972 , abstract =
-
[39]
Theoretical and Applied Genetics , author =
Integration of crop modeling and sensing into molecular breeding for nutritional quality and stress tolerance , volume =. Theoretical and Applied Genetics , author =. 2025 , pages =. doi:10.1007/s00122-025-04984-y , abstract =
-
[40]
Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang , editor =. A. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.64 , abstract =
-
[41]
Gu, Jindong and Han, Zhen and Chen, Shuo and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip , month = jul, year =. A. doi:10.48550/arXiv.2307.12980 , abstract =
-
[42]
Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.