CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

Dian Yang; Hanlei Li; Jie Leng; Mingyu Zhang; Qingyue Zhang; Shao-Lun Huang; Tengda Guo; Yaoyuan Liang; Yuhua Fu

arxiv: 2605.16416 · v1 · pith:ZBV5YFPVnew · submitted 2026-05-13 · 💻 cs.CV · cs.AI

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

Tengda Guo , Jie Leng , Hanlei Li , Yaoyuan Liang , Qingyue Zhang , Dian Yang , Mingyu Zhang , Yuhua Fu

show 1 more author

Shao-Lun Huang

This is my paper

Pith reviewed 2026-05-20 21:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fragmented visual reasoningcredit assignmentvision-language modelsprocess rewardGRPOTRACER-Benchmultimodal reasoningvisual evidence integration

0 comments

The pith

CAVE assigns credit to each visual reasoning step using three process signals so models learn to combine scattered image evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models struggle when relevant visual clues sit in different parts of an image and the question leaves the right combination underdetermined. The paper proposes CAVE to solve this by scoring every intermediate action according to how much it updates the model's belief, gathers new evidence, and adjusts focus. These scores then drive optimization through GRPO so the model learns more reliable strategies for linking nonlocal information. The approach yields clear gains on both existing benchmarks and the authors' new TRACER-Bench, which supplies intermediate evidence labels across four confusable reasoning dimensions, while general multimodal performance remains competitive. Analyses further indicate the method strengthens robustness when dependencies span longer distances or deeper cross-region links.

Core claim

CAVE is a structured process-reward method based on GRPO for interleaved visual reasoning. It evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Experiments show that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and the newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE also en

What carries the argument

Credit Assignment for Visual Evidence (CAVE), a GRPO-based process-reward method that scores each reasoning action with three signals—belief update, evidence acquisition, and adaptive focus control—to optimize visual reasoning paths.

If this is right

Substantially improves performance on tasks that require integrating fragmented visual evidence.
Delivers gains on both existing public benchmarks and the new TRACER-Bench.
Maintains competitive results on general multimodal evaluations.
Increases robustness when visual dependencies span longer ranges or deeper cross-region connections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-signal credit structure could be tested on reasoning chains that cross text, audio, and image modalities at once.
TRACER-Bench's provision of intermediate evidence labels suggests future benchmarks should supervise full reasoning paths rather than final answers alone.
Applying CAVE-style scoring to longer-horizon visual search tasks might expose scaling limits in current VLM attention mechanisms.

Load-bearing premise

The three complementary reasoning process signals accurately measure the contribution of each intermediate action and can be used to reliably optimize visual reasoning strategies via GRPO.

What would settle it

Ablation results on TRACER-Bench in which removing any one of the three signals (belief update, evidence acquisition, or adaptive focus control) produces no improvement over a plain GRPO baseline on nonlocal reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.16416 by Dian Yang, Hanlei Li, Jie Leng, Mingyu Zhang, Qingyue Zhang, Shao-Lun Huang, Tengda Guo, Yaoyuan Liang, Yuhua Fu.

**Figure 2.** Figure 2: Overview of CAVE. Given an input pair, the policy performs fragmented visual evidence reasoning. CAVE assigns structured process credits to belief update, evidence acquisition, and adaptive focus control, and is optimized with GRPO. atf denotes the teacher-forced action under the current context state, obtained outside the rollout and used only for reward computation. GRPO and process rewards. GRPO/R1-styl… view at source ↗

**Figure 3.** Figure 3: Scenario distribution of TRACER-Bench [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Process-credit validation and difficulty analysis. (a) Distribution of correct and wrong [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of rollout trajectories before CAVE training. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Case study of rollout trajectories after CAVE training. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Representative example of Rule-Switching Navigation. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Representative example of Nonsemantic Tracing. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Representative example of Embedded Matching. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Representative example of Remote-Sensing Subimage Matching. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Composition of the training corpus used for [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Representative Tunnel-Vision-style jumping example in the training corpus. The sample [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Training dynamics of CAVE process rewards. The total structured reward and its three [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: KL dynamics during CAVE training. The KL loss increases in the early stage and then [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Teacher-forced answer and evidence gains. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAVE adds a structured process-reward method with three signals and a new benchmark for fragmented visual reasoning, but the signals lack shown validation against actual causal impact.

read the letter

The main thing to know is that this paper proposes CAVE, a GRPO-based method that breaks credit assignment into three signals—belief update, evidence acquisition, and adaptive focus control—to help VLMs handle nonlocal visual evidence, plus a new benchmark TRACER-Bench with intermediate labels across four confusable dimensions. It frames the problem as Fragmented Visual Reasoning and claims better performance on those tasks without hurting general multimodal results, plus some robustness gains on longer dependencies. That combination of targeted signals and a purpose-built dataset is the clearest new element. The benchmark in particular looks useful because it supplies the kind of intermediate evidence that standard datasets often omit, which could let others measure reasoning paths more directly. The paper also keeps the scope reasonable by showing the method does not degrade on unrelated evaluations. The soft spot is exactly the one the stress-test note flags. The abstract gives no details on how the three signals are computed or whether they correlate with ground-truth contributions to the final answer. Without that check, it is hard to know if the structured reward is doing real credit assignment or simply adding overhead to a generic RL loop. If the signals are heuristic or redundant, the reported improvements could shrink once proper ablations appear. This is worth a referee because the problem is real, the benchmark adds something concrete, and the method is specific enough to test. Readers working on process supervision or visual reasoning benchmarks would get the most out of it. I would send it to peer review with a request to strengthen the signal validation and add controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces CAVE, a structured process-reward approach based on GRPO for improving Vision-Language Models on fragmented visual reasoning tasks that require integrating nonlocal visual evidence. It defines three complementary signals—belief update, evidence acquisition, and adaptive focus control—to assign credit at the action level during interleaved reasoning. The work also presents TRACER-Bench, a new benchmark covering four nonlocal and semantically confusable reasoning dimensions with intermediate evidence annotations. Experiments claim substantial gains on fragmented visual reasoning benchmarks and TRACER-Bench while preserving competitive results on general multimodal evaluations, with additional analyses on robustness to longer-range dependencies.

Significance. If the three process signals can be shown to faithfully quantify per-action contributions to final correctness, CAVE would offer a targeted credit-assignment mechanism that addresses a recognized weakness in current VLM reasoning pipelines. This could meaningfully advance reliable multimodal integration of fragmented evidence and provide a reproducible template for process-level rewards in visual reasoning.

major comments (3)

[Method (process-reward formulation)] The central claim that the three signals (belief update, evidence acquisition, adaptive focus control) provide reliable per-action credit assignment rests on an unverified assumption. The manuscript provides no direct validation—such as correlation analysis or ablation against ground-truth causal impact on answer correctness—particularly for nonlocal evidence cases; without this, the reported gains could arise from generic RL effects rather than structured credit assignment.
[Experiments and TRACER-Bench description] TRACER-Bench is presented as supplying key intermediate evidence to supervise reasoning paths, yet the evaluation does not report whether the learned policies actually utilize these annotations for credit assignment or merely for final-answer supervision. This leaves open whether the benchmark truly tests the proposed mechanism.
[Further analyses] The robustness analysis under longer-range and deeper cross-region dependencies is described qualitatively; quantitative breakdowns (e.g., performance stratified by dependency length or number of regions) are needed to substantiate the claim that CAVE specifically improves handling of fragmented evidence.

minor comments (2)

[Method] Notation for the three signals should be formalized with explicit equations showing how each is computed from model internals or outputs.
[Introduction] The abstract and introduction would benefit from a concise comparison table contrasting CAVE with prior process-reward or outcome-reward baselines in the VLM literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, clarifying our current approach where possible and outlining planned revisions to address the concerns raised.

read point-by-point responses

Referee: [Method (process-reward formulation)] The central claim that the three signals (belief update, evidence acquisition, adaptive focus control) provide reliable per-action credit assignment rests on an unverified assumption. The manuscript provides no direct validation—such as correlation analysis or ablation against ground-truth causal impact on answer correctness—particularly for nonlocal evidence cases; without this, the reported gains could arise from generic RL effects rather than structured credit assignment.

Authors: We acknowledge that the manuscript does not include explicit correlation analysis between the per-action process rewards and ground-truth causal impact on final correctness. Our ablations demonstrate that ablating individual signals reduces performance specifically on fragmented reasoning tasks while preserving general capabilities, which provides indirect support for the structured nature of the credit assignment. However, to more directly address the concern and distinguish from generic RL effects, we will add a new subsection with correlation analysis and causal impact ablations, stratified by nonlocal evidence cases. revision: yes
Referee: [Experiments and TRACER-Bench description] TRACER-Bench is presented as supplying key intermediate evidence to supervise reasoning paths, yet the evaluation does not report whether the learned policies actually utilize these annotations for credit assignment or merely for final-answer supervision. This leaves open whether the benchmark truly tests the proposed mechanism.

Authors: The intermediate evidence annotations in TRACER-Bench are used to compute the three process-reward signals (belief update, evidence acquisition, and adaptive focus control) that guide action-level credit assignment during GRPO training. The final-answer supervision is used only for the outcome reward. We did not report an explicit breakdown of reliance on annotations versus final-answer signals. We will revise the experimental section to include this clarification along with an ablation showing performance when intermediate annotations are withheld from the process rewards. revision: yes
Referee: [Further analyses] The robustness analysis under longer-range and deeper cross-region dependencies is described qualitatively; quantitative breakdowns (e.g., performance stratified by dependency length or number of regions) are needed to substantiate the claim that CAVE specifically improves handling of fragmented evidence.

Authors: We agree that quantitative stratification would strengthen the robustness claims. The current manuscript reports aggregate improvements and qualitative observations on longer-range dependencies. We will add tables and figures breaking down performance by dependency length and number of regions on both TRACER-Bench and public benchmarks to provide the requested quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: CAVE's process signals and GRPO optimization form an independent credit-assignment method

full rationale

The paper defines CAVE as a structured process-reward approach that computes three explicit signals (belief update, evidence acquisition, adaptive focus control) to assign credit at the action level and then applies GRPO. These signals are presented as complementary heuristics derived from reasoning process internals rather than fitted to final-answer correctness or defined in terms of the optimization target. TRACER-Bench supplies external intermediate supervision and public benchmarks provide independent evaluation. No equations or self-citations reduce the claimed performance gain to a tautology or to parameters fitted on the same target metric. The derivation chain therefore remains non-circular and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; full paper would be required to audit modeling choices such as reward weighting or GRPO hyperparameters.

pith-pipeline@v0.9.0 · 5745 in / 1130 out tokens · 50979 ms · 2026-05-20T21:07:40.489365+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RCAVE(τ) = λbu Cbu(τ) + λea Cea(τ) + λaf Caf(τ) with teacher-forced log-likelihood gains

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 7 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tianmin Han, Zhitao Gong, Sina Samangooei, William Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, et al. Flamingo: a visual language model for few-shot l...

work page 2022
[2]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[3]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023

work page 2023
[4]

Qwen2.5-VL technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, et al. Qwen2.5-VL technical report, 2025

work page 2025
[5]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[6]

Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

work page 2024
[7]

VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs

Shmuel Berman and Jia Deng. VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs. InAdvances in Neural Information Processing Systems, 2025. URL https: //vlmtunnel.github.io/. Spotlight presentation

work page 2025
[8]

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y . Charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, and Kuan Li. ...

work page 2026
[9]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[10]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, and Yu Cheng. OpenThinkIMG: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

thinking with images

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing “thinking with images” via reinforcement learning, 2025

work page 2025
[12]

Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025

work page 2025
[13]

VisualPRM: An effective process reward model for multimodal reasoning, 2025

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang. VisualPRM: An effective process reward model for multimodal reasoning, 2025

work page 2025
[14]

Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=qkWP6phrvZ

work page 2026
[15]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Yang Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. 10

work page 2024
[16]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[17]

LLaV A-OneVision: Easy visual task transfer, 2024

Bo Li, Kaichen Zhang, Biao Zhang, et al. LLaV A-OneVision: Easy visual task transfer, 2024

work page 2024
[18]

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. MME- RealWorld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

BLINK: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, et al. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 2024

work page 2024
[20]

Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, et al. Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[21]

DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[22]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. LLaV A-CoT: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. ICCV 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025

Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Li- juan Wang. Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025

work page 2025
[24]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

work page arXiv 2025
[25]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[26]

ViperGPT: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

VLM-R1: A stable and generalizable R1-style large vision-language model, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable R1-style large vision-language model, 2025

work page 2025
[29]

R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[30]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- VLThinker: An early exploration to complex vision-language reasoning via iterative self- improvement.arXiv preprint arXiv:2503.17352, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21805–21830, Suzhou, ...

work page doi:10.18653/v1/2025.emnlp-main.1106 2025
[32]

GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025

work page 2025
[33]

RealWorldQA: A benchmark for evaluating real-world spatial understanding in multimodal ai models.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

xAI. RealWorldQA: A benchmark for evaluating real-world spatial understanding in multimodal ai models.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

work page 2024
[34]

Are we on the right way for evaluating large vision-language models? In Advances in Neural Information Processing Systems, 2024

Lin Chen et al. Are we on the right way for evaluating large vision-language models? In Advances in Neural Information Processing Systems, 2024

work page 2024
[35]

Introducing GPT-5.2

OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ ,

work page
[36]

Accessed: 2026-04-26

work page 2026
[37]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-04-26

work page 2026
[38]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, et al. R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[42]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025
[43]

SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025

work page 2025
[44]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[45]

MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 2024

work page 2024
[46]

RS Subimage Matching

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, et al. HybridFlow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025. 12 A Case study on rollout trajectories Figure 5 and Figure 6 present a comparison of rollout trajectories before and after CA VE training. The ...

work page 2025

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tianmin Han, Zhitao Gong, Sina Samangooei, William Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, et al. Flamingo: a visual language model for few-shot l...

work page 2022

[2] [2]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[3] [3]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023

work page 2023

[4] [4]

Qwen2.5-VL technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, et al. Qwen2.5-VL technical report, 2025

work page 2025

[5] [5]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[6] [6]

Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

work page 2024

[7] [7]

VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs

Shmuel Berman and Jia Deng. VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs. InAdvances in Neural Information Processing Systems, 2025. URL https: //vlmtunnel.github.io/. Spotlight presentation

work page 2025

[8] [8]

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y . Charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, and Kuan Li. ...

work page 2026

[9] [9]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[10] [10]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, and Yu Cheng. OpenThinkIMG: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

thinking with images

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing “thinking with images” via reinforcement learning, 2025

work page 2025

[12] [12]

Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025

work page 2025

[13] [13]

VisualPRM: An effective process reward model for multimodal reasoning, 2025

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang. VisualPRM: An effective process reward model for multimodal reasoning, 2025

work page 2025

[14] [14]

Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=qkWP6phrvZ

work page 2026

[15] [15]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Yang Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. 10

work page 2024

[16] [16]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[17] [17]

LLaV A-OneVision: Easy visual task transfer, 2024

Bo Li, Kaichen Zhang, Biao Zhang, et al. LLaV A-OneVision: Easy visual task transfer, 2024

work page 2024

[18] [18]

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. MME- RealWorld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

BLINK: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, et al. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 2024

work page 2024

[20] [20]

Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, et al. Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[21] [21]

DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[22] [22]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. LLaV A-CoT: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. ICCV 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025

Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Li- juan Wang. Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025

work page 2025

[24] [24]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

work page arXiv 2025

[25] [25]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[26] [26]

ViperGPT: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[27] [27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

VLM-R1: A stable and generalizable R1-style large vision-language model, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable R1-style large vision-language model, 2025

work page 2025

[29] [29]

R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[30] [30]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- VLThinker: An early exploration to complex vision-language reasoning via iterative self- improvement.arXiv preprint arXiv:2503.17352, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21805–21830, Suzhou, ...

work page doi:10.18653/v1/2025.emnlp-main.1106 2025

[32] [32]

GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025

work page 2025

[33] [33]

RealWorldQA: A benchmark for evaluating real-world spatial understanding in multimodal ai models.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

xAI. RealWorldQA: A benchmark for evaluating real-world spatial understanding in multimodal ai models.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

work page 2024

[34] [34]

Are we on the right way for evaluating large vision-language models? In Advances in Neural Information Processing Systems, 2024

Lin Chen et al. Are we on the right way for evaluating large vision-language models? In Advances in Neural Information Processing Systems, 2024

work page 2024

[35] [35]

Introducing GPT-5.2

OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ ,

work page

[36] [36]

Accessed: 2026-04-26

work page 2026

[37] [37]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-04-26

work page 2026

[38] [38]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, et al. R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[42] [42]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025

[43] [43]

SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025

work page 2025

[44] [44]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[45] [45]

MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 2024

work page 2024

[46] [46]

RS Subimage Matching

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, et al. HybridFlow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025. 12 A Case study on rollout trajectories Figure 5 and Figure 6 present a comparison of rollout trajectories before and after CA VE training. The ...

work page 2025