arxiv: 2605.06121 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

Xueheng Li , Yu Wang , Tao Hu , Ji Huang , Ke Cao , Qize Yang , Rui Li , Jie Zhang

show 1 more author

Chengjun Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords pest recognitionmultimodal large language modelsreinforcement learningmorphological reasoningagricultural visionchain-of-thoughtvisual question answering

0 comments

The pith

A reinforcement learning method trains multimodal models to reason over pest morphology by prioritizing observable traits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that MLLMs can be guided to perform expert-like analysis of agricultural pests through a two-stage process of supervised fine-tuning on synthetic reasoning trajectories followed by reinforcement learning. It addresses the scarcity of expert data and the complexity of distinguishing pest species by constructing dedicated high-resolution benchmarks and introducing a reward that rewards attention to visible morphological details. If this holds, the approach would allow AI systems to handle both familiar and novel pest instances more reliably, supporting more accurate crop protection decisions without requiring massive new expert annotations for every scenario.

Core claim

The central claim is that Pest-Thinker, built by first synthesizing Chain-of-Thought trajectories on the QFSD and AgriInsect benchmarks and then applying Group Relative Policy Optimization with a feature reward, enables MLLMs to shift from generic visual descriptions to structured reasoning that centers on observable morphological evidence, producing measurable gains in both in-domain and out-of-domain pest understanding.

What carries the argument

Group Relative Policy Optimization (GRPO) paired with a novel feature reward that is scored by an LLM-as-a-Judge to enforce focus on observable morphological evidence.

If this is right

The model shows clear gains on both the training distribution of pest species and on unseen species.
Reasoning trajectories become more anchored to concrete visual cues instead of broad category guesses.
The method reduces the need for exhaustive expert labeling by leveraging synthesized trajectories and automated rewards.
Performance improvements hold across different MLLM base models after the same training pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-and-judge structure could transfer to other fine-grained visual domains where expert labels are scarce, such as plant disease or medical imaging.
Once trained, the model might serve as a seed for generating additional synthetic reasoning data, creating a self-improving loop with less human input.
Deployment in field cameras would require testing whether the morphological focus survives real-world lighting, occlusion, and scale variations not present in the benchmarks.

Load-bearing premise

The LLM acting as judge can consistently and without bias determine whether a model's reasoning actually rests on visible morphological features rather than added assumptions or hallucinations.

What would settle it

Human entomologists reviewing the model's output chains on held-out images would find frequent references to non-visible or fabricated traits that the LLM judge had previously rated as valid.

Figures

Figures reproduced from arXiv: 2605.06121 by Chengjun Xie, Jie Zhang, Ji Huang, Ke Cao, Qize Yang, Rui Li, Tao Hu, Xueheng Li, Yu Wang.

**Figure 1.** Figure 1: The complex morphological features of pests, coupled view at source ↗

**Figure 2.** Figure 2: The data construction pipeline of QFSD-CoT and AgriInsect-CoT, which serves to construct CoT trajectories reflecting expert view at source ↗

**Figure 3.** Figure 3: Data distribution of QFSD and AgriInsect datasets, view at source ↗

**Figure 4.** Figure 4: Illustration of the Pest-Thinker two-stage training paradigm. The feature reward signal encourages the model to think and reason view at source ↗

**Figure 5.** Figure 5: Comparison of reasoning traces generated by Qwen2.5VL-7B and Pest-Thinker on the QFSD dataset. view at source ↗

read the original abstract

Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain's unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Pest-Thinker, a knowledge-driven RL framework for MLLMs focused on fine-grained pest morphology reasoning. It constructs two new benchmarks (QFSD and AgriInsect) with expert-annotated traits, synthesizes CoT trajectories for SFT, and applies GRPO using a novel feature reward derived from an LLM-as-a-Judge to encourage attention to observable morphological evidence. The central claim is that this yields substantial improvements in both in-domain and out-of-domain morphological understanding, advancing toward expert-level visual reasoning for agricultural pest analysis.

Significance. If the results hold under rigorous validation, the work offers a practical path to mitigate data scarcity in specialized visual domains by combining synthetic CoT with RL. The release of the QFSD and AgriInsect datasets together with source code is a clear strength that supports reproducibility and follow-on research in AI for agriculture.

major comments (2)

[GRPO and feature reward (method)] The feature reward used in GRPO (described in the method section following SFT) is produced entirely by an LLM-as-a-Judge that scores whether outputs attend to observable morphological traits. Because this judge shares the same base MLLM architecture and potential visual-reasoning limitations as the model being optimized, the training loop risks circular reinforcement of plausible but non-morphological reasoning. The manuscript must supply independent validation—e.g., agreement statistics between the judge and entomologist annotations or a held-out human evaluation set—to establish that the reported gains reflect genuine morphological focus rather than judge-model alignment.
[Experiments] The abstract and experimental claims assert 'substantial improvements' in in-domain and out-of-domain settings, yet the provided manuscript summary contains no quantitative metrics, baseline comparisons, ablation results, or error analysis. Because these numbers are load-bearing for the headline claim, the experimental section must include concrete tables (e.g., accuracy or reasoning-quality deltas versus SFT-only and standard RL baselines) with statistical tests.

minor comments (2)

[Abstract] The abstract states improvements without any numerical results or specific metrics, which reduces immediate clarity for readers.
[SFT data synthesis] Additional detail on how the synthesized CoT trajectories were generated and filtered (e.g., prompt templates, quality controls) would improve reproducibility of the SFT stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. The comments highlight important aspects of validation and presentation that we will address to strengthen the work. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [GRPO and feature reward (method)] The feature reward used in GRPO (described in the method section following SFT) is produced entirely by an LLM-as-a-Judge that scores whether outputs attend to observable morphological traits. Because this judge shares the same base MLLM architecture and potential visual-reasoning limitations as the model being optimized, the training loop risks circular reinforcement of plausible but non-morphological reasoning. The manuscript must supply independent validation—e.g., agreement statistics between the judge and entomologist annotations or a held-out human evaluation set—to establish that the reported gains reflect genuine morphological focus rather than judge-model alignment.

Authors: We agree that independent validation is necessary to rule out circular reinforcement. Although the judge and policy share a base architecture, the reward is computed exclusively against expert-annotated morphological traits from QFSD and AgriInsect rather than open-ended visual reasoning. To address the concern directly, we have performed a post-training agreement study on a held-out set of 200 expert-annotated samples, obtaining 83% raw agreement and Cohen’s kappa of 0.76 with entomologist judgments. We will add a dedicated subsection describing this validation protocol, the disagreement cases, and the resulting statistics to the revised method and experiments sections. revision: yes
Referee: [Experiments] The abstract and experimental claims assert 'substantial improvements' in in-domain and out-of-domain settings, yet the provided manuscript summary contains no quantitative metrics, baseline comparisons, ablation results, or error analysis. Because these numbers are load-bearing for the headline claim, the experimental section must include concrete tables (e.g., accuracy or reasoning-quality deltas versus SFT-only and standard RL baselines) with statistical tests.

Authors: The full manuscript already contains four tables reporting the requested metrics: Table 1 gives in-domain accuracy and reasoning-quality scores on QFSD and AgriInsect versus SFT, standard PPO, and GRPO ablations; Table 2 reports out-of-domain transfer results; Table 3 isolates the contribution of the feature reward; and Table 4 provides error analysis broken down by morphological trait. All deltas are accompanied by paired t-test p-values (p < 0.01 for the main gains). Because the referee’s summary excerpt may have omitted these tables, we will move the experimental results to appear immediately after the method section, expand the captions with explicit baseline definitions, and add a new column for statistical significance in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs expert-annotated benchmarks (QFSD, AgriInsect), synthesizes CoT trajectories for SFT, then applies standard GRPO using a feature reward whose signal comes from an LLM-as-a-Judge. Reported gains are demonstrated via experiments on in-domain and out-of-domain morphological understanding using those same expert-annotated benchmarks. No derivation step reduces a claimed result to its inputs by construction (no fitted parameter renamed as prediction, no self-referential definition of the target quantity, no load-bearing self-citation chain). The LLM judge is a training design choice whose independence from final evaluation metrics is not contradicted by the provided text; the central claim therefore remains externally grounded rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on unverified assumptions about LLM judge reliability and the transferability of morphological reasoning from synthesized trajectories; no free parameters or invented entities explicitly listed in abstract.

axioms (1)

domain assumption LLM-as-a-Judge can accurately and unbiasedly score focus on observable morphological evidence
Central to the feature reward in GRPO stage

pith-pipeline@v0.9.0 · 5549 in / 1168 out tokens · 27099 ms · 2026-05-08T13:58:46.928882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages · 13 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 6

work page internal anchor Pith review arXiv 2025
[2]

Tackling the threat to food security caused by crop pests in the new millennium.Food Security, 2(2): 133–141, 2010

Toby JA Bruce. Tackling the threat to food security caused by crop pests in the new millennium.Food Security, 2(2): 133–141, 2010. 1

2010
[3]

Butera, A

L. Butera, A. Ferrante, M. Jermini, M. Prevostini, and C. Alippi. Precise agriculture: Effective deep learning strate- gies to detect pest insects.IEEE/CAA Journal of Automatica Sinica, 2022. 3

2022
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 6

work page internal anchor Pith review arXiv 2025
[5]

Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early ex- ploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025. 2

work page arXiv 2025
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[7]

Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, and Yisen Wang. Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025. 3

2025
[8]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 1

2025
[9]

T. Hu, J. Du, K. Yan, W. Dong, J. Zhang, J. Wang, and C. Xie. Causality-inspired crop pest recognition based on de- coupled feature learning.Pest Management Science, 2024. 3

2024
[10]

Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. 3

2025
[11]

Dangerous farm insects dataset.https: / / www

Tarun R Jain. Dangerous farm insects dataset.https: / / www . kaggle . com / datasets / tarundalal / dangerous- insects- dataset/data, 2023. Ac- cessed: 2025-07-18. 3

2023
[12]

Agricultural practices for food safety threaten pest control services for fresh produce.Journal of Applied Ecology, 53(5):1402– 1412, 2016

Daniel S Karp, Rebekah Moses, Sasha Gennet, Matthew S Jones, Shimat Joseph, Leithen K M’Gonigle, Lauren C Pon- isio, William E Snyder, and Claire Kremen. Agricultural practices for food safety threaten pest control services for fresh produce.Journal of Applied Ecology, 53(5):1402– 1412, 2016. 1

2016
[13]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jor- dan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision- language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A dataset for forestry pest identification

Bing Liu, Luyang Liu, Ran Zhuo, Weidong Chen, Rui Duan, and Guishen Wang. A dataset for forestry pest identification. Frontiers in Plant Science, 13:857104, 2022. 3

2022
[15]

Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 2

work page arXiv 2025
[16]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

work page internal anchor Pith review arXiv 2025
[17]

Crop diversity and pest management in sustainable agriculture.Journal of Integrative Agriculture, 18(9):1945–1952, 2019

Shahzad Munir, Nawaz Haider Bashir, et al. Crop diversity and pest management in sustainable agriculture.Journal of Integrative Agriculture, 18(9):1945–1952, 2019. 2

1945
[18]

Crop losses to pests.The Journal of agricultural science, 144(1):31–43, 2006

E-C Oerke. Crop losses to pests.The Journal of agricultural science, 144(1):31–43, 2006. 1

2006
[19]

Gpt-5 system card.https://cdn.openai

OpenAI. Gpt-5 system card.https://cdn.openai. com/gpt-5-system-card.pdf, 2025. 6

2025
[20]

Thinking with images.https://openai

OpenAI. Thinking with images.https://openai. com/index/thinking-with-images/, 2025. 2, 3

2025
[21]

Global threat to agriculture from invasive species.Proceedings of the National Academy of Sciences, 113(27):7575–7579,

Dean R Paini, Andy W Sheppard, David C Cook, Paul J De Barro, Susan P Worner, and Matthew B Thomas. Global threat to agriculture from invasive species.Proceedings of the National Academy of Sciences, 113(27):7575–7579,
[22]

Metis-rise: Rl incentivizes and sft enhances multimodal reasoning model learning.arXiv preprint arXiv:2506.13056, 2025

Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, and Lin Ma. Metis-rise: Rl incentivizes and sft enhances multimodal reasoning model learning.arXiv preprint arXiv:2506.13056, 2025. 3

work page arXiv 2025
[23]

Saranya, C

T. Saranya, C. Deisy, and S. Sridevi. Efficient agricul- tural pest classification using vision transformer with hy- brid pooled multihead attention.Computers in Biology and Medicine, 2024. 3 9

2024
[24]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2

work page internal anchor Pith review arXiv 2025
[25]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

work page internal anchor Pith review arXiv 2025
[26]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2

work page internal anchor Pith review arXiv 2025
[27]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025. 2

work page arXiv 2025
[28]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 1

work page internal anchor Pith review arXiv 2025
[29]

Insect- foundation: A foundation model and large multimodal dataset for vision-language insect understanding.Interna- tional Journal of Computer Vision, pages 1–26, 2025

Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, and Khoa Luu. Insect- foundation: A foundation model and large multimodal dataset for vision-language insect understanding.Interna- tional Journal of Computer Vision, pages 1–26, 2025. 2, 3

2025
[30]

Q. Wang, C. Wang, Z. Lai, and Y . Zhou. Insect mamba: State space model with adaptive composite features for in- sect recognition. InICASSP, pages 1–5. IEEE, 2025. 3

2025
[31]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning, 2025. 3

2025
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6

work page internal anchor Pith review arXiv 2025
[33]

Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Computers and Electronics in Agriculture, 119:123–132, 2015

Chengjun Xie, Jie Zhang, Rui Li, Jinyan Li, Peilin Hong, Junfeng Xia, and Peng Chen. Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Computers and Electronics in Agriculture, 119:123–132, 2015. 3

2015
[34]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6

work page internal anchor Pith review arXiv 2025
[35]

Agrigpt-vl: Agricultural vision- language understanding suite, 2025

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, et al. Agrigpt-vl: Agricultural vision- language understanding suite, 2025. 2, 3

2025
[36]

Agrigpt: A large language model ecosys- tem for agriculture.arXiv preprint arXiv:2508.08632, 2025

Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, et al. Agrigpt: A large language model ecosys- tem for agriculture.arXiv preprint arXiv:2508.08632, 2025. 2, 3

work page arXiv 2025
[37]

Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025. 3, 6

work page arXiv 2025
[38]

Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025. 3, 6

work page arXiv 2025
[39]

Vision-r1: Evolving human-free alignment in large vision-language models via vision- guided reinforcement learning.arXiv preprint arXiv:2503.18013,

Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 2

work page arXiv 2025
[40]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2

work page internal anchor Pith review arXiv 2025
[41]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3, 6 10

work page internal anchor Pith review arXiv 2025