AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Defeng Gu; Haohuan Fu; Henglian Huang; Jianxi Huang; Jiarui Zhang; Jing Wu; Juepeng Zheng; Lingyuan Zhao; Qingmei Li; Shuohong Lou

arxiv: 2511.23253 · v3 · pith:7V7NKNXYnew · submitted 2025-11-28 · 💻 cs.AI

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Yibin Wen , Qingmei Li , Zi Ye , Jiarui Zhang , Xiaoya Fan , Zurong Mai , Jing Wu , Shuohong Lou

show 9 more authors

Yuhang Chen Henglian Huang Yang Zhang Defeng Gu Lingyuan Zhao Yutong Lu Haohuan Fu Jianxi Huang Juepeng Zheng

This is my paper

Pith reviewed 2026-05-21 17:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords chain-of-thought reasoningvision-language modelsagricultural applicationsvisual question answeringreasoning benchmarkprecision farmingcrop monitoring

0 comments

The pith

AgroCoT dataset tests vision-language models on chain-of-thought reasoning for farm tasks and finds clear performance gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AgroCoT, a dataset of 4759 visual question-answering items that require step-by-step reasoning to answer questions about crops, pests, monitoring, and sustainability. It runs zero-shot tests on 30 vision-language models to measure how well they perform logical reasoning and problem-solving in agricultural settings. The results show that existing models still fall short on these tasks. This matters because stronger reasoning would directly support practical uses like precision farming and better environmental decisions. The work argues that benchmarks must include chain-of-thought evaluation to expose these shortfalls rather than relying on standard visual question answering alone.

Core claim

AgroCoT supplies a curated collection of 4759 samples that embed chain-of-thought reasoning into visual question answering for agriculture; when 30 representative vision-language models are evaluated on it, they exhibit measurable gaps in reasoning ability, demonstrating that chain-of-thought prompting and assessment are necessary to reveal true capability in complex agricultural contexts.

What carries the argument

The AgroCoT dataset, which pairs agricultural images with questions that demand explicit chain-of-thought steps for logical problem solving.

If this is right

Models can be ranked more reliably for agricultural use once chain-of-thought reasoning is measured.
Development efforts can focus on closing the identified reasoning gaps in vision-language models.
Precision farming and environmental applications become more feasible when models handle step-by-step agricultural reasoning.
Future benchmarks in other domains may adopt similar chain-of-thought integration to test reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a template for creating reasoning tests in other specialized domains like medicine or manufacturing.
Training procedures that explicitly reward chain-of-thought outputs might narrow the gaps observed here.
Real-world agricultural systems could integrate these models more safely once reasoning performance is verified on this benchmark.

Load-bearing premise

The 4759 curated samples accurately capture the reasoning and problem-solving skills required in real agricultural contexts.

What would settle it

If models that score high on AgroCoT show no advantage over low-scoring models when deployed on actual field tasks such as pest detection or crop monitoring that require similar reasoning.

Figures

Figures reproduced from arXiv: 2511.23253 by Defeng Gu, Haohuan Fu, Henglian Huang, Jianxi Huang, Jiarui Zhang, Jing Wu, Juepeng Zheng, Lingyuan Zhao, Qingmei Li, Shuohong Lou, Xiaoya Fan, Yang Zhang, Yibin Wen, Yuhang Chen, Yutong Lu, Zi Ye, Zurong Mai.

**Figure 2.** Figure 2: Comparison of VLMs across multiple dimensions. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The number of samples across different dimensions in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Hierarchical task system of AgriCoT. Based on the progressive cognitive pipeline in agricultural intelligence, AgriCoT constructs [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The construction of AgriCoT benchmark primarily comprises four steps: collecting samples from data sources, ensuring the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Statistics of AgriCoT, from three perspectives: the distribution of question types across steps and dimensions, the number of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Performance and analysis of various VLMs across different perspectives, including model size, CoT length and CoT step conut. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Percentage of Different Datasets. 7. Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agricultural vision-language benchmarks (see Tab. 3), each offering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26] enhances our dataset with large-scale crop disease coverage, comprising 137,000 disease im… view at source ↗

**Figure 9.** Figure 9: Worldwide Distribution of AgriCoT [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: An example of CoT pre-generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: ROUGE-L F1 (a) and BERTScore F1 (b) vary with model parameter size. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Overall performance of several VLMs using ROUGE-1, ROUGE-2, ROUGE-L and BERTScore across different step counts. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: A case of the Organism Identification task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: A case of the Anomaly Detection task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: A case of the Organism Counting task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: A case of the Regional Counting task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: A case of the Morphological Statistics task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: A case of the Pest Recognition task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: A case of the Disease Diagnosis task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: A case of the BioMorph Identification task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: A case of the Spatial Relationship task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: A case of the Boundary Analysis task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: A case of the Area Evaluation task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

**Figure 25.** Figure 25: A case of the Agri-Tools task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗

**Figure 26.** Figure 26: A case of the Agri-Methods task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗

**Figure 27.** Figure 27: A case of the Management Decisions task with responses from GPT-4.1 and InternVL3-38B. [PITH_FULL_IMAGE:figures/full_fig_p030_27.png] view at source ↗

read the original abstract

Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/AgroCoT/AgroCoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgroCoT adds a targeted agriculture VQA dataset with CoT annotations and tests 30 models, but missing curation details limit how much we can trust the reasoning claims.

read the letter

The main point is that this paper releases AgroCoT, a set of 4,759 VQA samples in agriculture that include chain-of-thought reasoning steps, then runs it on 30 VLMs to show performance gaps in zero-shot logical problem-solving for tasks like crop monitoring or pest detection. They make the dataset public on Hugging Face, which is a straightforward positive step for anyone working on domain-specific multimodal models. The evaluation covers both proprietary and open-source systems and gives a clear picture of where current models fall short on multi-step reasoning in this setting. That part is useful as a practical benchmark for applied work in precision agriculture. The soft spot is the curation process. The abstract describes the samples as carefully curated for critical reasoning, yet provides no information on who generated the questions and CoT rationales, whether domain experts reviewed them, or any inter-annotator agreement numbers. Without those details it is difficult to rule out that the reported gaps reflect dataset artifacts rather than genuine differences in reasoning ability. The stress-test note on unverified curation assumptions matches what is visible here. This work is aimed at researchers building or testing VLMs for agricultural applications and at people who create specialized benchmarks. A reader focused on practical evaluation tools would find the model comparisons worth looking at. It deserves peer review because the dataset release and broad evaluation give it enough substance for referees to request the missing methodological transparency and strengthen the paper.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgroCoT, a VQA dataset of 4,759 samples that incorporates Chain-of-Thought reasoning to evaluate VLMs on agricultural tasks such as precision farming and pest detection. It reports results from evaluating 30 proprietary and open-source VLMs in zero-shot settings, identifying gaps in reasoning capabilities and arguing for the importance of CoT-based assessment. The dataset is released publicly via Hugging Face.

Significance. A rigorously validated benchmark of this type could help identify specific limitations in current VLMs for multi-step agricultural reasoning and encourage development of models better suited to real-world deployment. The public release of the dataset itself is a constructive contribution that enables further research.

major comments (3)

[Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.
[Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.
[Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.

minor comments (2)

[Abstract] Abstract: consider adding one sentence summarizing the main categories of agricultural reasoning tasks represented in the dataset.
[Data Availability] Dataset release: confirm that the Hugging Face repository includes documentation on sample provenance, licensing, and any usage restrictions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.

Authors: We acknowledge the need for greater transparency in the dataset curation process. In the revised version of the manuscript, we will provide a detailed account of how the questions and corresponding CoT rationales were generated, including the involvement of qualified annotators with expertise in agriculture. We will also describe the filtering criteria used to ensure that each sample requires multi-step reasoning and the checks performed to verify agricultural fidelity and relevance to real-world scenarios such as precision farming and pest detection. revision: yes
Referee: [Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.

Authors: We agree that the evaluation protocol requires more explicit description. We will update the manuscript to detail the scoring process for CoT reasoning, which involves assessing step-wise correctness with partial credit awarded for incomplete but directionally correct reasoning steps. The scoring is primarily conducted through human judgment by domain experts, with automated verification for basic output format compliance. Additionally, we will report the inter-annotator agreement metrics for the ground-truth rationales to demonstrate the reliability of the annotations. revision: yes
Referee: [Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.

Authors: This is a valid point. To strengthen the central claim, we will add a new analysis subsection that examines potential confounding factors. Specifically, we will present results showing that performance gaps remain consistent when controlling for visual pattern variations and different question phrasing styles. This evidence will support that the gaps are attributable to reasoning limitations rather than dataset artifacts. We will also discuss the design choices made during curation to promote diversity in visuals and questions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset benchmark with no derivation chain

full rationale

The paper introduces the AgroCoT VQA dataset of 4,759 samples focused on CoT reasoning in agriculture and reports zero-shot performance of 30 public VLMs. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described methodology. Claims rest on empirical results against an externally released dataset and standard models, which are independently verifiable. No self-citations support uniqueness theorems, ansatzes, or load-bearing premises. This is a standard benchmark release paper whose central evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the curated samples plus the assumption that zero-shot model performance on this dataset reveals genuine reasoning gaps rather than dataset artifacts.

axioms (1)

domain assumption The 4759 samples were curated to require logical reasoning and effective problem-solving in agricultural contexts.
This premise is invoked when the abstract states the dataset evaluates critical reasoning skills needed in complex agricultural contexts.

pith-pipeline@v0.9.0 · 5792 in / 1323 out tokens · 58766 ms · 2026-05-21T17:54:38.058820+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose AgriCoT, a comprehensive agricultural multi-step, multimodal CoT dataset comprising 4,535 VQA pairs... five key dimensions (object detection, quantitative analysis, disease monitoring, spatial understanding, and environmental management)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation of 26 representative VLMs... reveals a gap in their reasoning capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Reference graph

Works this paper leans on

186 extracted references · 186 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Agrogpt: Efficient agricultural vision-language model with expert tuning

Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, and Rao Muham- mad Anwer. Agrogpt: Efficient agricultural vision-language model with expert tuning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 5687–5696. IEEE, 2025. 3

work page 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2, 6, 7, 3, 5, 8, 9, 10

work page
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro- Geoinformatics), pages 1–6, 2025. 2

work page 2025
[8]

Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023

Yiyi Cao, Lei Chen, Yuan Yuan, and Guangling Sun. Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023. 1

work page 2023
[9]

M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024. 3

work page 2024
[10]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 3, 5, 6, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations

Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayy- eri, Gokhan Tur, Dilek Hakkani-T ¨ur, and Vikram S Adve. Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations. arXiv preprint arXiv:2506.20100, 2025. 2, 3

work page arXiv 2025
[13]

Interleaved-modal chain-of-thought

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 3

work page 2025
[14]

Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025

Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025. 2, 3

work page arXiv 2025
[15]

Agmmu: A comprehensive agricultural multimodal understanding benchmark

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding benchmark. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track, 2025. 2, 3, 5, 1

work page 2025
[16]

Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs

Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin- Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025. 3

work page arXiv 2025
[17]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,

work page arXiv
[18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 3

work page arXiv 2025
[20]

Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 3

work page 2022
[21]

Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025

Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025. 2, 7, 3, 4, 5, 6, 8, 9, 10 9

work page 2025
[22]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuo- hong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models un- derstand agricultural scenes? benchmarking with agromind. arXiv preprint arXiv:2505.12207, 2025. 2, 3, 5, 7, 1

work page arXiv 2025
[23]

Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025

Zhiyuan Li, Yi Chang, and Yuan Wu. Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025. 3

work page arXiv 2025
[24]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1, 2025. 2

work page arXiv 2025
[25]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 7

work page 2004
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 7, 3, 5, 6, 8, 9, 10

work page 2024
[28]

A multimodal benchmark dataset and model for crop disease diagnosis

Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Ko- hou Wang, Kai Wang, and Shiguo Lian. A multimodal benchmark dataset and model for crop disease diagnosis. In European Conference on Computer Vision, pages 157–170. Springer, 2024. 2, 3, 5, 1

work page 2024
[29]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

6, 7, 3, 4, 5, 8, 9, 10

work page
[31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page
[32]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3

work page 2024
[34]

Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025. 3

work page arXiv 2025
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2

work page 2021
[36]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 3

work page 2022
[37]

Veningston, and Syed Mufassir Yaseen

Tawseef Ayoub Shaikh, Tabasum Rasool, K. Veningston, and Syed Mufassir Yaseen. The role of large language models in agriculture: harvesting the future with LLM intelligence. Progress in Artificial Intelligence, 14(2):117–164, 2025. 1

work page 2025
[38]

Agrobench: Vision-language model benchmark in agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 2, 3, 5, 1

work page 2025
[39]

Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023

Mukul Singh, Jos ´e Cambronero, Sumit Gulwani, Vu Le, and Gust Verbruggen. Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023. 3

work page arXiv 2023
[40]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 3, 5, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 7, 3, 5, 6, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases

Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, and Feng Zheng. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv preprint arXiv:2412.02158, 2024. 3

work page arXiv 2024
[43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step

Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2586–2606, 2025. 3

work page 2025
[46]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3

work page 2022
[47]

Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025. 2, 3

work page 2025
[48]

AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, et al. Agrigpt-vl: Agricul- tural vision-language understanding suite.arXiv preprint arXiv:2510.04002, 2025. 3

work page arXiv 2025
[49]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023. 3

work page 2023
[50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 3

work page 2024
[51]

Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

Hang Zhang, Jiawei Sun, Renqi Chen, Wei Liu, Zhonghang Yuan, Xinzhe Zheng, Zhefan Wang, Zhiyuan Yang, Hang Yan, Hansen Zhong, et al. Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

work page
[52]

Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024

Kunpeng Zhang, Li Ma, Beibei Cui, Xin Li, Boqiang Zhang, and Na Xie. Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024. 1, 3

work page 2024
[53]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

work page 2025
[54]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 1904
[55]

Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023

Yinshuo Zhang, Lei Chen, and Yuan Yuan. Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023. 1

work page 2023
[56]

Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025

Yinshuo Zhang, Lei Chen, and Yuan Yuan. Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025. 1

work page 2025
[57]

Enhancing chain of thought prompting in large lan- guage models via reasoning patterns

Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, and Jinqiao Wang. Enhancing chain of thought prompting in large lan- guage models via reasoning patterns. InProceedings of the AAAI Conference on Artificial Intelligence, pages 25985– 25993, 2025. 2

work page 2025
[58]

Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024

Xinyan Zhao, Baiyan Chen, Mengxue Ji, Xinyue Wang, Yuhan Yan, Jinming Zhang, Shiyingjie Liu, Muyang Ye, and Chunli Lv. Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024. 1

work page 2024
[59]

Tinyllava: A framework of small-scale large multimodal models

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 7, 3, 5, 6, 8, 9, 10

work page arXiv 2024
[60]

Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

Yueyue Zhou, Hongping Yan, Kun Ding, Tingting Cai, and Yan Zhang. Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

work page
[61]

Percentage of Different Datasets

1 11 AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture Supplementary Material 87.8% 4.4% 4.0% 3.9% 12.2% AgroMind AgroBench AGMMU CDDM Total 4,535 QA pairs Figure 8. Percentage of Different Datasets

work page
[62]

Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab

Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab. 3), each of- fering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26]enhances our dataset withlarge-scale crop disease coverage, comprising 137...

work page arXiv 2024
[63]

<option> || <reasoning steps>

CoT Construction Details To ensure that the CoTs generated by GPT-4o exhibit both deep reasoning and a consistent structure, we design a prompt template (see Fig. 10). In addition to the basic in- puts (reference image, question, and options), the template explicitly guides the model to clarify the core problem, pro- duce detailed and coherent visual desc...

work page
[64]

answer then rea- soning

Experiment Details 9.1. Evaluation Protocols We supplement the evaluation with ROUGE-1, ROUGE-2, and ROUGE-L, computing precision, recall, and F1 score for each. ROUGE measures n-gram overlap between gen- erated and reference CoTs, reflecting content coverage and sequence structure. Precision indicates how well the model captures relevant information, whi...

work page arXiv 2011
[65]

Keyword extraction capability The ROUGE-1 scores presented in Tab

More Experimental Results 10.1. Keyword extraction capability The ROUGE-1 scores presented in Tab. 5 provide insights into the reasoning capabilities of the evaluated models across various tasks. ROUGE-1, which measures the over- lap of unigrams between the generated CoT and the refer- ence, reflects the model’s ability to generate keywords in reasoning t...

work page
[66]

Further Discussion Does the reasoning ability of VLMs improve as the pa- rameters increase?As shown in Fig. 11 (a) and (b), ex- cept for InternVL2&3 and Qwen2.5-VL, whose reasoning capabilities improve with an increase in parameters when the number of parameters is small, the reasoning capabili- ties of other models remain relatively stable once the num- ...

work page
[67]

As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential

Limitations and Future Work While AgriCoT has been carefully curated and manually re- fined to address agricultural VQA tasks, the current dataset scale remains insufficient given the rapid evolution of mod- ern VLMs. As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential. ...

work page
[68]

AgriCoT will also be released as open-source to facilitate academic research

Ethical Considerations All data in this dataset are sourced from publicly available open-source datasets, and their use adheres to the respec- tive copyright and licensing terms. AgriCoT will also be released as open-source to facilitate academic research. Seven annotators complete training covering agricultural knowledge, annotation protocols, and ethica...

work page arXiv
[69]

The question asks about the type of damage observed on the plant in the image, and provides four options related to different types of plant stress or damage. 2. Observing the image, the tree in the center appears to have no leaves, while the surrounding vegetation is green and healthy, indicating that the issue is specific to this tree and not a general ...

work page
[70]

The plant in the image appears to have no leaves, which suggests it has experienced some form of damage or stress. 2. The branches are bare, indicating that the damage occurred after the leaves had already grown, which is more consistent with a late-season event. 3. The absence of visible insect damage or signs of bacterial infection (such as leaf spots o...

work page
[71]

The question asks to identify the type of damage observed on the plant in the image, providing four options: late season freeze-hail-freeze damage, insect infestation damage, bacterial leaf scorch, and early summer drought stress

work page
[72]

Observing the image, the tree appears to have sparse foliage, with branches that are mostly bare and no visible signs of leaf scorch, insect activity, or drought-related wilting

work page
[73]

Late season freeze-hail-freeze damage is consistent with the observed symptoms, as freezing temperatures can cause buds and leaves to fail to develop properly, leaving the tree looking bare and stressed

work page
[74]

The other options (insect infestation, bacterial leaf scorch, and drought stress) typically show distinct signs such as holes in leaves, browning or scorching of leaf edges, or drooping foliage, none of which are visible in the image

work page
[75]

Plant Management Figure 13

Based on the analysis, the most appropriate option is late season freeze-hail-freeze damage. Plant Management Figure 13. A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. Question: What are the distinctive features of the plant shown in the image? Options: A: pine tree with needle-like leaves and brown cones B: cactus with...

work page
[76]

The question asks for the distinctive features of the plant shown in the image, and provides four options describing different types of plants

work page
[77]

The fronds are arching and some are turning brown, while others are still green and vibrant

Observing the image, the plant has fronds that are divided into leaflets, which is characteristic of ferns. The fronds are arching and some are turning brown, while others are still green and vibrant. 3. Option A describes a pine tree, which has needle-like leaves and cones, not matching the image. Option B describes a cactus, which has thick, spiny stems...

work page
[78]

ostrich fern with arching habit and tough, wiry black rhizomes

The image shows a cluster of ferns with large, arching fronds. 2. The fronds have a distinct, feathery appearance with a central stem and leaflets arranged on either side. 3. Some fronds are green, while others have a brownish hue, indicating different stages of growth or decay. 4. The description of "ostrich fern with arching habit and tough, wiry black ...

work page
[79]

The question asks to identify the distinctive features of the plant shown in the image and match it with the most appropriate option

work page
[80]

The image depicts a plant with arching fronds, characteristic of ferns, and some parts of the plant show wiry, dark structures resembling rhizomes

work page
[81]

Option A describes a pine tree with needle-like leaves and brown cones, which does not match the plant in the image as the plant lacks needle-like leaves or cones

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

Agrogpt: Efficient agricultural vision-language model with expert tuning

Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, and Rao Muham- mad Anwer. Agrogpt: Efficient agricultural vision-language model with expert tuning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 5687–5696. IEEE, 2025. 3

work page 2025

[3] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [5]

2, 6, 7, 3, 5, 8, 9, 10

work page

[5] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [7]

Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro- Geoinformatics), pages 1–6, 2025. 2

work page 2025

[7] [8]

Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023

Yiyi Cao, Lei Chen, Yuan Yuan, and Guangling Sun. Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023. 1

work page 2023

[8] [9]

M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024. 3

work page 2024

[9] [10]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 3, 5, 6, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [12]

Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations

Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayy- eri, Gokhan Tur, Dilek Hakkani-T ¨ur, and Vikram S Adve. Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations. arXiv preprint arXiv:2506.20100, 2025. 2, 3

work page arXiv 2025

[12] [13]

Interleaved-modal chain-of-thought

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 3

work page 2025

[13] [14]

Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025

Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025. 2, 3

work page arXiv 2025

[14] [15]

Agmmu: A comprehensive agricultural multimodal understanding benchmark

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding benchmark. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track, 2025. 2, 3, 5, 1

work page 2025

[15] [16]

Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs

Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin- Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025. 3

work page arXiv 2025

[16] [17]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,

work page arXiv

[17] [18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [19]

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 3

work page arXiv 2025

[19] [20]

Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 3

work page 2022

[20] [21]

Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025

Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025. 2, 7, 3, 4, 5, 6, 8, 9, 10 9

work page 2025

[21] [22]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuo- hong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models un- derstand agricultural scenes? benchmarking with agromind. arXiv preprint arXiv:2505.12207, 2025. 2, 3, 5, 7, 1

work page arXiv 2025

[22] [23]

Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025

Zhiyuan Li, Yi Chang, and Yuan Wu. Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025. 3

work page arXiv 2025

[23] [24]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1, 2025. 2

work page arXiv 2025

[24] [25]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 7

work page 2004

[25] [26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 7, 3, 5, 6, 8, 9, 10

work page 2024

[27] [28]

A multimodal benchmark dataset and model for crop disease diagnosis

Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Ko- hou Wang, Kai Wang, and Shiguo Lian. A multimodal benchmark dataset and model for crop disease diagnosis. In European Conference on Computer Vision, pages 157–170. Springer, 2024. 2, 3, 5, 1

work page 2024

[28] [29]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

6, 7, 3, 4, 5, 8, 9, 10

work page

[30] [31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page

[31] [32]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3

work page 2024

[33] [34]

Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025. 3

work page arXiv 2025

[34] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2

work page 2021

[35] [36]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 3

work page 2022

[36] [37]

Veningston, and Syed Mufassir Yaseen

Tawseef Ayoub Shaikh, Tabasum Rasool, K. Veningston, and Syed Mufassir Yaseen. The role of large language models in agriculture: harvesting the future with LLM intelligence. Progress in Artificial Intelligence, 14(2):117–164, 2025. 1

work page 2025

[37] [38]

Agrobench: Vision-language model benchmark in agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 2, 3, 5, 1

work page 2025

[38] [39]

Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023

Mukul Singh, Jos ´e Cambronero, Sumit Gulwani, Vu Le, and Gust Verbruggen. Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023. 3

work page arXiv 2023

[39] [40]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 3, 5, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 7, 3, 5, 6, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [42]

Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases

Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, and Feng Zheng. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv preprint arXiv:2412.02158, 2024. 3

work page arXiv 2024

[42] [43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [44]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step

Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2586–2606, 2025. 3

work page 2025

[45] [46]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3

work page 2022

[46] [47]

Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025. 2, 3

work page 2025

[47] [48]

AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, et al. Agrigpt-vl: Agricul- tural vision-language understanding suite.arXiv preprint arXiv:2510.04002, 2025. 3

work page arXiv 2025

[48] [49]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023. 3

work page 2023

[49] [50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 3

work page 2024

[50] [51]

Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

Hang Zhang, Jiawei Sun, Renqi Chen, Wei Liu, Zhonghang Yuan, Xinzhe Zheng, Zhefan Wang, Zhiyuan Yang, Hang Yan, Hansen Zhong, et al. Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,

work page

[51] [52]

Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024

Kunpeng Zhang, Li Ma, Beibei Cui, Xin Li, Boqiang Zhang, and Na Xie. Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024. 1, 3

work page 2024

[52] [53]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

work page 2025

[53] [54]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 1904

[54] [55]

Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023

Yinshuo Zhang, Lei Chen, and Yuan Yuan. Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023. 1

work page 2023

[55] [56]

Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025

Yinshuo Zhang, Lei Chen, and Yuan Yuan. Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025. 1

work page 2025

[56] [57]

Enhancing chain of thought prompting in large lan- guage models via reasoning patterns

Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, and Jinqiao Wang. Enhancing chain of thought prompting in large lan- guage models via reasoning patterns. InProceedings of the AAAI Conference on Artificial Intelligence, pages 25985– 25993, 2025. 2

work page 2025

[57] [58]

Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024

Xinyan Zhao, Baiyan Chen, Mengxue Ji, Xinyue Wang, Yuhan Yan, Jinming Zhang, Shiyingjie Liu, Muyang Ye, and Chunli Lv. Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024. 1

work page 2024

[58] [59]

Tinyllava: A framework of small-scale large multimodal models

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 7, 3, 5, 6, 8, 9, 10

work page arXiv 2024

[59] [60]

Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

Yueyue Zhou, Hongping Yan, Kun Ding, Tingting Cai, and Yan Zhang. Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,

work page

[60] [61]

Percentage of Different Datasets

1 11 AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture Supplementary Material 87.8% 4.4% 4.0% 3.9% 12.2% AgroMind AgroBench AGMMU CDDM Total 4,535 QA pairs Figure 8. Percentage of Different Datasets

work page

[61] [62]

Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab

Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab. 3), each of- fering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26]enhances our dataset withlarge-scale crop disease coverage, comprising 137...

work page arXiv 2024

[62] [63]

<option> || <reasoning steps>

CoT Construction Details To ensure that the CoTs generated by GPT-4o exhibit both deep reasoning and a consistent structure, we design a prompt template (see Fig. 10). In addition to the basic in- puts (reference image, question, and options), the template explicitly guides the model to clarify the core problem, pro- duce detailed and coherent visual desc...

work page

[63] [64]

answer then rea- soning

Experiment Details 9.1. Evaluation Protocols We supplement the evaluation with ROUGE-1, ROUGE-2, and ROUGE-L, computing precision, recall, and F1 score for each. ROUGE measures n-gram overlap between gen- erated and reference CoTs, reflecting content coverage and sequence structure. Precision indicates how well the model captures relevant information, whi...

work page arXiv 2011

[64] [65]

Keyword extraction capability The ROUGE-1 scores presented in Tab

More Experimental Results 10.1. Keyword extraction capability The ROUGE-1 scores presented in Tab. 5 provide insights into the reasoning capabilities of the evaluated models across various tasks. ROUGE-1, which measures the over- lap of unigrams between the generated CoT and the refer- ence, reflects the model’s ability to generate keywords in reasoning t...

work page

[65] [66]

Further Discussion Does the reasoning ability of VLMs improve as the pa- rameters increase?As shown in Fig. 11 (a) and (b), ex- cept for InternVL2&3 and Qwen2.5-VL, whose reasoning capabilities improve with an increase in parameters when the number of parameters is small, the reasoning capabili- ties of other models remain relatively stable once the num- ...

work page

[66] [67]

As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential

Limitations and Future Work While AgriCoT has been carefully curated and manually re- fined to address agricultural VQA tasks, the current dataset scale remains insufficient given the rapid evolution of mod- ern VLMs. As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential. ...

work page

[67] [68]

AgriCoT will also be released as open-source to facilitate academic research

Ethical Considerations All data in this dataset are sourced from publicly available open-source datasets, and their use adheres to the respec- tive copyright and licensing terms. AgriCoT will also be released as open-source to facilitate academic research. Seven annotators complete training covering agricultural knowledge, annotation protocols, and ethica...

work page arXiv

[68] [69]

The question asks about the type of damage observed on the plant in the image, and provides four options related to different types of plant stress or damage. 2. Observing the image, the tree in the center appears to have no leaves, while the surrounding vegetation is green and healthy, indicating that the issue is specific to this tree and not a general ...

work page

[69] [70]

The plant in the image appears to have no leaves, which suggests it has experienced some form of damage or stress. 2. The branches are bare, indicating that the damage occurred after the leaves had already grown, which is more consistent with a late-season event. 3. The absence of visible insect damage or signs of bacterial infection (such as leaf spots o...

work page

[70] [71]

The question asks to identify the type of damage observed on the plant in the image, providing four options: late season freeze-hail-freeze damage, insect infestation damage, bacterial leaf scorch, and early summer drought stress

work page

[71] [72]

Observing the image, the tree appears to have sparse foliage, with branches that are mostly bare and no visible signs of leaf scorch, insect activity, or drought-related wilting

work page

[72] [73]

Late season freeze-hail-freeze damage is consistent with the observed symptoms, as freezing temperatures can cause buds and leaves to fail to develop properly, leaving the tree looking bare and stressed

work page

[73] [74]

The other options (insect infestation, bacterial leaf scorch, and drought stress) typically show distinct signs such as holes in leaves, browning or scorching of leaf edges, or drooping foliage, none of which are visible in the image

work page

[74] [75]

Plant Management Figure 13

Based on the analysis, the most appropriate option is late season freeze-hail-freeze damage. Plant Management Figure 13. A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. Question: What are the distinctive features of the plant shown in the image? Options: A: pine tree with needle-like leaves and brown cones B: cactus with...

work page

[75] [76]

The question asks for the distinctive features of the plant shown in the image, and provides four options describing different types of plants

work page

[76] [77]

The fronds are arching and some are turning brown, while others are still green and vibrant

Observing the image, the plant has fronds that are divided into leaflets, which is characteristic of ferns. The fronds are arching and some are turning brown, while others are still green and vibrant. 3. Option A describes a pine tree, which has needle-like leaves and cones, not matching the image. Option B describes a cactus, which has thick, spiny stems...

work page

[77] [78]

ostrich fern with arching habit and tough, wiry black rhizomes

The image shows a cluster of ferns with large, arching fronds. 2. The fronds have a distinct, feathery appearance with a central stem and leaflets arranged on either side. 3. Some fronds are green, while others have a brownish hue, indicating different stages of growth or decay. 4. The description of "ostrich fern with arching habit and tough, wiry black ...

work page

[78] [79]

The question asks to identify the distinctive features of the plant shown in the image and match it with the most appropriate option

work page

[79] [80]

The image depicts a plant with arching fronds, characteristic of ferns, and some parts of the plant show wiry, dark structures resembling rhizomes

work page

[80] [81]

Option A describes a pine tree with needle-like leaves and brown cones, which does not match the plant in the image as the plant lacks needle-like leaves or cones

work page