pith. sign in

arxiv: 2511.23253 · v3 · pith:7V7NKNXYnew · submitted 2025-11-28 · 💻 cs.AI

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Pith reviewed 2026-05-21 17:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords chain-of-thought reasoningvision-language modelsagricultural applicationsvisual question answeringreasoning benchmarkprecision farmingcrop monitoring
0
0 comments X

The pith

AgroCoT dataset tests vision-language models on chain-of-thought reasoning for farm tasks and finds clear performance gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AgroCoT, a dataset of 4759 visual question-answering items that require step-by-step reasoning to answer questions about crops, pests, monitoring, and sustainability. It runs zero-shot tests on 30 vision-language models to measure how well they perform logical reasoning and problem-solving in agricultural settings. The results show that existing models still fall short on these tasks. This matters because stronger reasoning would directly support practical uses like precision farming and better environmental decisions. The work argues that benchmarks must include chain-of-thought evaluation to expose these shortfalls rather than relying on standard visual question answering alone.

Core claim

AgroCoT supplies a curated collection of 4759 samples that embed chain-of-thought reasoning into visual question answering for agriculture; when 30 representative vision-language models are evaluated on it, they exhibit measurable gaps in reasoning ability, demonstrating that chain-of-thought prompting and assessment are necessary to reveal true capability in complex agricultural contexts.

What carries the argument

The AgroCoT dataset, which pairs agricultural images with questions that demand explicit chain-of-thought steps for logical problem solving.

If this is right

  • Models can be ranked more reliably for agricultural use once chain-of-thought reasoning is measured.
  • Development efforts can focus on closing the identified reasoning gaps in vision-language models.
  • Precision farming and environmental applications become more feasible when models handle step-by-step agricultural reasoning.
  • Future benchmarks in other domains may adopt similar chain-of-thought integration to test reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as a template for creating reasoning tests in other specialized domains like medicine or manufacturing.
  • Training procedures that explicitly reward chain-of-thought outputs might narrow the gaps observed here.
  • Real-world agricultural systems could integrate these models more safely once reasoning performance is verified on this benchmark.

Load-bearing premise

The 4759 curated samples accurately capture the reasoning and problem-solving skills required in real agricultural contexts.

What would settle it

If models that score high on AgroCoT show no advantage over low-scoring models when deployed on actual field tasks such as pest detection or crop monitoring that require similar reasoning.

Figures

Figures reproduced from arXiv: 2511.23253 by Defeng Gu, Haohuan Fu, Henglian Huang, Jianxi Huang, Jiarui Zhang, Jing Wu, Juepeng Zheng, Lingyuan Zhao, Qingmei Li, Shuohong Lou, Xiaoya Fan, Yang Zhang, Yibin Wen, Yuhang Chen, Yutong Lu, Zi Ye, Zurong Mai.

Figure 1
Figure 1. Figure 1: Comparison of AgriCoT with previous agricultural mul [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of VLMs across multiple dimensions. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The number of samples across different dimensions in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical task system of AgriCoT. Based on the progressive cognitive pipeline in agricultural intelligence, AgriCoT constructs [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The construction of AgriCoT benchmark primarily comprises four steps: collecting samples from data sources, ensuring the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of AgriCoT, from three perspectives: the distribution of question types across steps and dimensions, the number of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance and analysis of various VLMs across different perspectives, including model size, CoT length and CoT step conut. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Percentage of Different Datasets. 7. Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri￾cultural vision-language benchmarks (see Tab. 3), each of￾fering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26] enhances our dataset with large-scale crop disease coverage, comprising 137,000 disease im… view at source ↗
Figure 9
Figure 9. Figure 9: Worldwide Distribution of AgriCoT [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of CoT pre-generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ROUGE-L F1 (a) and BERTScore F1 (b) vary with model parameter size. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall performance of several VLMs using ROUGE-1, ROUGE-2, ROUGE-L and BERTScore across different step counts. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A case of the Organism Identification task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A case of the Anomaly Detection task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A case of the Organism Counting task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A case of the Regional Counting task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A case of the Morphological Statistics task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A case of the Pest Recognition task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: A case of the Disease Diagnosis task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: A case of the BioMorph Identification task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: A case of the Spatial Relationship task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: A case of the Boundary Analysis task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: A case of the Area Evaluation task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: A case of the Agri-Tools task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: A case of the Agri-Methods task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: A case of the Management Decisions task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p030_27.png] view at source ↗
read the original abstract

Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/AgroCoT/AgroCoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgroCoT, a VQA dataset of 4,759 samples that incorporates Chain-of-Thought reasoning to evaluate VLMs on agricultural tasks such as precision farming and pest detection. It reports results from evaluating 30 proprietary and open-source VLMs in zero-shot settings, identifying gaps in reasoning capabilities and arguing for the importance of CoT-based assessment. The dataset is released publicly via Hugging Face.

Significance. A rigorously validated benchmark of this type could help identify specific limitations in current VLMs for multi-step agricultural reasoning and encourage development of models better suited to real-world deployment. The public release of the dataset itself is a constructive contribution that enables further research.

major comments (3)
  1. [Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.
  2. [Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.
  3. [Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.
minor comments (2)
  1. [Abstract] Abstract: consider adding one sentence summarizing the main categories of agricultural reasoning tasks represented in the dataset.
  2. [Data Availability] Dataset release: confirm that the Hugging Face repository includes documentation on sample provenance, licensing, and any usage restrictions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.

    Authors: We acknowledge the need for greater transparency in the dataset curation process. In the revised version of the manuscript, we will provide a detailed account of how the questions and corresponding CoT rationales were generated, including the involvement of qualified annotators with expertise in agriculture. We will also describe the filtering criteria used to ensure that each sample requires multi-step reasoning and the checks performed to verify agricultural fidelity and relevance to real-world scenarios such as precision farming and pest detection. revision: yes

  2. Referee: [Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.

    Authors: We agree that the evaluation protocol requires more explicit description. We will update the manuscript to detail the scoring process for CoT reasoning, which involves assessing step-wise correctness with partial credit awarded for incomplete but directionally correct reasoning steps. The scoring is primarily conducted through human judgment by domain experts, with automated verification for basic output format compliance. Additionally, we will report the inter-annotator agreement metrics for the ground-truth rationales to demonstrate the reliability of the annotations. revision: yes

  3. Referee: [Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.

    Authors: This is a valid point. To strengthen the central claim, we will add a new analysis subsection that examines potential confounding factors. Specifically, we will present results showing that performance gaps remain consistent when controlling for visual pattern variations and different question phrasing styles. This evidence will support that the gaps are attributable to reasoning limitations rather than dataset artifacts. We will also discuss the design choices made during curation to promote diversity in visuals and questions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset benchmark with no derivation chain

full rationale

The paper introduces the AgroCoT VQA dataset of 4,759 samples focused on CoT reasoning in agriculture and reports zero-shot performance of 30 public VLMs. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described methodology. Claims rest on empirical results against an externally released dataset and standard models, which are independently verifiable. No self-citations support uniqueness theorems, ansatzes, or load-bearing premises. This is a standard benchmark release paper whose central evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the curated samples plus the assumption that zero-shot model performance on this dataset reveals genuine reasoning gaps rather than dataset artifacts.

axioms (1)
  • domain assumption The 4759 samples were curated to require logical reasoning and effective problem-solving in agricultural contexts.
    This premise is invoked when the abstract states the dataset evaluates critical reasoning skills needed in complex agricultural contexts.

pith-pipeline@v0.9.0 · 5792 in / 1323 out tokens · 58766 ms · 2026-05-21T17:54:38.058820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

    cs.CV 2026-04 accept novelty 8.0

    HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Reference graph

Works this paper leans on

186 extracted references · 186 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [3]

    Agrogpt: Efficient agricultural vision-language model with expert tuning

    Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, and Rao Muham- mad Anwer. Agrogpt: Efficient agricultural vision-language model with expert tuning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 5687–5696. IEEE, 2025. 3

  3. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

  4. [5]

    2, 6, 7, 3, 5, 8, 9, 10

  5. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

  6. [7]

    Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

    Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro- Geoinformatics), pages 1–6, 2025. 2

  7. [8]

    Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023

    Yiyi Cao, Lei Chen, Yuan Yuan, and Guangling Sun. Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023. 1

  8. [9]

    M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024. 3

  9. [10]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 3, 5, 6, 8, 9, 10

  10. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

  11. [12]

    Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations

    Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayy- eri, Gokhan Tur, Dilek Hakkani-T ¨ur, and Vikram S Adve. Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations. arXiv preprint arXiv:2506.20100, 2025. 2, 3

  12. [13]

    Interleaved-modal chain-of-thought

    Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 3

  13. [14]

    Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025

    Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025. 2, 3

  14. [15]

    Agmmu: A comprehensive agricultural multimodal understanding benchmark

    Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding benchmark. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track, 2025. 2, 3, 5, 1

  15. [16]

    Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs

    Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin- Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025. 3

  16. [17]

    W., Li, L., Yang, Z., Wang, L., and Cheng, Y

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,

  17. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  18. [19]

    Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 3

  19. [20]

    Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 3

  20. [21]

    Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025

    Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025. 2, 7, 3, 4, 5, 6, 8, 9, 10 9

  21. [22]

    Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

    Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuo- hong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models un- derstand agricultural scenes? benchmarking with agromind. arXiv preprint arXiv:2505.12207, 2025. 2, 3, 5, 7, 1

  22. [23]

    Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025

    Zhiyuan Li, Yi Chang, and Yuan Wu. Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025. 3

  23. [24]

    Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

    Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1, 2025. 2

  24. [25]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 7

  25. [26]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

  26. [27]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 7, 3, 5, 6, 8, 9, 10

  27. [28]

    A multimodal benchmark dataset and model for crop disease diagnosis

    Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Ko- hou Wang, Kai Wang, and Shiguo Lian. A multimodal benchmark dataset and model for crop disease diagnosis. In European Conference on Computer Vision, pages 157–170. Springer, 2024. 2, 3, 5, 1

  28. [29]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  29. [30]

    6, 7, 3, 4, 5, 8, 9, 10

  30. [31]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

  31. [32]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

  32. [33]

    Compositional chain-of-thought prompting for large multimodal models

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3

  33. [34]

    Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025

    Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025. 3

  34. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2

  35. [36]

    A-okvqa: A benchmark for visual question answering using world knowl- edge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 3

  36. [37]

    Veningston, and Syed Mufassir Yaseen

    Tawseef Ayoub Shaikh, Tabasum Rasool, K. Veningston, and Syed Mufassir Yaseen. The role of large language models in agriculture: harvesting the future with LLM intelligence. Progress in Artificial Intelligence, 14(2):117–164, 2025. 1

  37. [38]

    Agrobench: Vision-language model benchmark in agriculture

    Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 2, 3, 5, 1

  38. [39]

    Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023

    Mukul Singh, Jos ´e Cambronero, Sumit Gulwani, Vu Le, and Gust Verbruggen. Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023. 3

  39. [40]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 3, 5, 8, 9, 10

  40. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 7, 3, 5, 6, 8, 9, 10

  41. [42]

    Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases

    Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, and Feng Zheng. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv preprint arXiv:2412.02158, 2024. 3

  42. [43]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

  43. [44]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3

  44. [45]

    Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step

    Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2586–2606, 2025. 3

  45. [46]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3

  46. [47]

    Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

    Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025. 2, 3

  47. [48]

    AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025

    Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, et al. Agrigpt-vl: Agricul- tural vision-language understanding suite.arXiv preprint arXiv:2510.04002, 2025. 3

  48. [49]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023. 3

  49. [50]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 3

  50. [51]

    Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

    Hang Zhang, Jiawei Sun, Renqi Chen, Wei Liu, Zhonghang Yuan, Xinzhe Zheng, Zhefan Wang, Zhiyuan Yang, Hang Yan, Hansen Zhong, et al. Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

  51. [52]

    Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024

    Kunpeng Zhang, Li Ma, Beibei Cui, Xin Li, Boqiang Zhang, and Na Xie. Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024. 1, 3

  52. [53]

    Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

    Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

  53. [54]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 6, 7

  54. [55]

    Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023

    Yinshuo Zhang, Lei Chen, and Yuan Yuan. Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023. 1

  55. [56]

    Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025

    Yinshuo Zhang, Lei Chen, and Yuan Yuan. Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025. 1

  56. [57]

    Enhancing chain of thought prompting in large lan- guage models via reasoning patterns

    Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, and Jinqiao Wang. Enhancing chain of thought prompting in large lan- guage models via reasoning patterns. InProceedings of the AAAI Conference on Artificial Intelligence, pages 25985– 25993, 2025. 2

  57. [58]

    Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024

    Xinyan Zhao, Baiyan Chen, Mengxue Ji, Xinyue Wang, Yuhan Yan, Jinming Zhang, Shiyingjie Liu, Muyang Ye, and Chunli Lv. Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024. 1

  58. [59]

    Tinyllava: A framework of small-scale large multimodal models

    Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 7, 3, 5, 6, 8, 9, 10

  59. [60]

    Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

    Yueyue Zhou, Hongping Yan, Kun Ding, Tingting Cai, and Yan Zhang. Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

  60. [61]

    Percentage of Different Datasets

    1 11 AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture Supplementary Material 87.8% 4.4% 4.0% 3.9% 12.2% AgroMind AgroBench AGMMU CDDM Total 4,535 QA pairs Figure 8. Percentage of Different Datasets

  61. [62]

    Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab

    Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab. 3), each of- fering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26]enhances our dataset withlarge-scale crop disease coverage, comprising 137...

  62. [63]

    <option> || <reasoning steps>

    CoT Construction Details To ensure that the CoTs generated by GPT-4o exhibit both deep reasoning and a consistent structure, we design a prompt template (see Fig. 10). In addition to the basic in- puts (reference image, question, and options), the template explicitly guides the model to clarify the core problem, pro- duce detailed and coherent visual desc...

  63. [64]

    answer then rea- soning

    Experiment Details 9.1. Evaluation Protocols We supplement the evaluation with ROUGE-1, ROUGE-2, and ROUGE-L, computing precision, recall, and F1 score for each. ROUGE measures n-gram overlap between gen- erated and reference CoTs, reflecting content coverage and sequence structure. Precision indicates how well the model captures relevant information, whi...

  64. [65]

    Keyword extraction capability The ROUGE-1 scores presented in Tab

    More Experimental Results 10.1. Keyword extraction capability The ROUGE-1 scores presented in Tab. 5 provide insights into the reasoning capabilities of the evaluated models across various tasks. ROUGE-1, which measures the over- lap of unigrams between the generated CoT and the refer- ence, reflects the model’s ability to generate keywords in reasoning t...

  65. [66]

    Further Discussion Does the reasoning ability of VLMs improve as the pa- rameters increase?As shown in Fig. 11 (a) and (b), ex- cept for InternVL2&3 and Qwen2.5-VL, whose reasoning capabilities improve with an increase in parameters when the number of parameters is small, the reasoning capabili- ties of other models remain relatively stable once the num- ...

  66. [67]

    As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential

    Limitations and Future Work While AgriCoT has been carefully curated and manually re- fined to address agricultural VQA tasks, the current dataset scale remains insufficient given the rapid evolution of mod- ern VLMs. As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential. ...

  67. [68]

    AgriCoT will also be released as open-source to facilitate academic research

    Ethical Considerations All data in this dataset are sourced from publicly available open-source datasets, and their use adheres to the respec- tive copyright and licensing terms. AgriCoT will also be released as open-source to facilitate academic research. Seven annotators complete training covering agricultural knowledge, annotation protocols, and ethica...

  68. [69]

    The question asks about the type of damage observed on the plant in the image, and provides four options related to different types of plant stress or damage. 2. Observing the image, the tree in the center appears to have no leaves, while the surrounding vegetation is green and healthy, indicating that the issue is specific to this tree and not a general ...

  69. [70]

    The plant in the image appears to have no leaves, which suggests it has experienced some form of damage or stress. 2. The branches are bare, indicating that the damage occurred after the leaves had already grown, which is more consistent with a late-season event. 3. The absence of visible insect damage or signs of bacterial infection (such as leaf spots o...

  70. [71]

    The question asks to identify the type of damage observed on the plant in the image, providing four options: late season freeze-hail-freeze damage, insect infestation damage, bacterial leaf scorch, and early summer drought stress

  71. [72]

    Observing the image, the tree appears to have sparse foliage, with branches that are mostly bare and no visible signs of leaf scorch, insect activity, or drought-related wilting

  72. [73]

    Late season freeze-hail-freeze damage is consistent with the observed symptoms, as freezing temperatures can cause buds and leaves to fail to develop properly, leaving the tree looking bare and stressed

  73. [74]

    The other options (insect infestation, bacterial leaf scorch, and drought stress) typically show distinct signs such as holes in leaves, browning or scorching of leaf edges, or drooping foliage, none of which are visible in the image

  74. [75]

    Plant Management Figure 13

    Based on the analysis, the most appropriate option is late season freeze-hail-freeze damage. Plant Management Figure 13. A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. Question: What are the distinctive features of the plant shown in the image? Options: A: pine tree with needle-like leaves and brown cones B: cactus with...

  75. [76]

    The question asks for the distinctive features of the plant shown in the image, and provides four options describing different types of plants

  76. [77]

    The fronds are arching and some are turning brown, while others are still green and vibrant

    Observing the image, the plant has fronds that are divided into leaflets, which is characteristic of ferns. The fronds are arching and some are turning brown, while others are still green and vibrant. 3. Option A describes a pine tree, which has needle-like leaves and cones, not matching the image. Option B describes a cactus, which has thick, spiny stems...

  77. [78]

    ostrich fern with arching habit and tough, wiry black rhizomes

    The image shows a cluster of ferns with large, arching fronds. 2. The fronds have a distinct, feathery appearance with a central stem and leaflets arranged on either side. 3. Some fronds are green, while others have a brownish hue, indicating different stages of growth or decay. 4. The description of "ostrich fern with arching habit and tough, wiry black ...

  78. [79]

    The question asks to identify the distinctive features of the plant shown in the image and match it with the most appropriate option

  79. [80]

    The image depicts a plant with arching fronds, characteristic of ferns, and some parts of the plant show wiry, dark structures resembling rhizomes

  80. [81]

    Option A describes a pine tree with needle-like leaves and brown cones, which does not match the plant in the image as the plant lacks needle-like leaves or cones

Showing first 80 references.