CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning
Pith reviewed 2026-05-20 21:07 UTC · model grok-4.3
The pith
CAVE assigns credit to each visual reasoning step using three process signals so models learn to combine scattered image evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAVE is a structured process-reward method based on GRPO for interleaved visual reasoning. It evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Experiments show that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and the newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE also en
What carries the argument
Credit Assignment for Visual Evidence (CAVE), a GRPO-based process-reward method that scores each reasoning action with three signals—belief update, evidence acquisition, and adaptive focus control—to optimize visual reasoning paths.
If this is right
- Substantially improves performance on tasks that require integrating fragmented visual evidence.
- Delivers gains on both existing public benchmarks and the new TRACER-Bench.
- Maintains competitive results on general multimodal evaluations.
- Increases robustness when visual dependencies span longer ranges or deeper cross-region connections.
Where Pith is reading between the lines
- The same three-signal credit structure could be tested on reasoning chains that cross text, audio, and image modalities at once.
- TRACER-Bench's provision of intermediate evidence labels suggests future benchmarks should supervise full reasoning paths rather than final answers alone.
- Applying CAVE-style scoring to longer-horizon visual search tasks might expose scaling limits in current VLM attention mechanisms.
Load-bearing premise
The three complementary reasoning process signals accurately measure the contribution of each intermediate action and can be used to reliably optimize visual reasoning strategies via GRPO.
What would settle it
Ablation results on TRACER-Bench in which removing any one of the three signals (belief update, evidence acquisition, or adaptive focus control) produces no improvement over a plain GRPO baseline on nonlocal reasoning tasks.
Figures
read the original abstract
Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAVE, a structured process-reward approach based on GRPO for improving Vision-Language Models on fragmented visual reasoning tasks that require integrating nonlocal visual evidence. It defines three complementary signals—belief update, evidence acquisition, and adaptive focus control—to assign credit at the action level during interleaved reasoning. The work also presents TRACER-Bench, a new benchmark covering four nonlocal and semantically confusable reasoning dimensions with intermediate evidence annotations. Experiments claim substantial gains on fragmented visual reasoning benchmarks and TRACER-Bench while preserving competitive results on general multimodal evaluations, with additional analyses on robustness to longer-range dependencies.
Significance. If the three process signals can be shown to faithfully quantify per-action contributions to final correctness, CAVE would offer a targeted credit-assignment mechanism that addresses a recognized weakness in current VLM reasoning pipelines. This could meaningfully advance reliable multimodal integration of fragmented evidence and provide a reproducible template for process-level rewards in visual reasoning.
major comments (3)
- [Method (process-reward formulation)] The central claim that the three signals (belief update, evidence acquisition, adaptive focus control) provide reliable per-action credit assignment rests on an unverified assumption. The manuscript provides no direct validation—such as correlation analysis or ablation against ground-truth causal impact on answer correctness—particularly for nonlocal evidence cases; without this, the reported gains could arise from generic RL effects rather than structured credit assignment.
- [Experiments and TRACER-Bench description] TRACER-Bench is presented as supplying key intermediate evidence to supervise reasoning paths, yet the evaluation does not report whether the learned policies actually utilize these annotations for credit assignment or merely for final-answer supervision. This leaves open whether the benchmark truly tests the proposed mechanism.
- [Further analyses] The robustness analysis under longer-range and deeper cross-region dependencies is described qualitatively; quantitative breakdowns (e.g., performance stratified by dependency length or number of regions) are needed to substantiate the claim that CAVE specifically improves handling of fragmented evidence.
minor comments (2)
- [Method] Notation for the three signals should be formalized with explicit equations showing how each is computed from model internals or outputs.
- [Introduction] The abstract and introduction would benefit from a concise comparison table contrasting CAVE with prior process-reward or outcome-reward baselines in the VLM literature.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, clarifying our current approach where possible and outlining planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Method (process-reward formulation)] The central claim that the three signals (belief update, evidence acquisition, adaptive focus control) provide reliable per-action credit assignment rests on an unverified assumption. The manuscript provides no direct validation—such as correlation analysis or ablation against ground-truth causal impact on answer correctness—particularly for nonlocal evidence cases; without this, the reported gains could arise from generic RL effects rather than structured credit assignment.
Authors: We acknowledge that the manuscript does not include explicit correlation analysis between the per-action process rewards and ground-truth causal impact on final correctness. Our ablations demonstrate that ablating individual signals reduces performance specifically on fragmented reasoning tasks while preserving general capabilities, which provides indirect support for the structured nature of the credit assignment. However, to more directly address the concern and distinguish from generic RL effects, we will add a new subsection with correlation analysis and causal impact ablations, stratified by nonlocal evidence cases. revision: yes
-
Referee: [Experiments and TRACER-Bench description] TRACER-Bench is presented as supplying key intermediate evidence to supervise reasoning paths, yet the evaluation does not report whether the learned policies actually utilize these annotations for credit assignment or merely for final-answer supervision. This leaves open whether the benchmark truly tests the proposed mechanism.
Authors: The intermediate evidence annotations in TRACER-Bench are used to compute the three process-reward signals (belief update, evidence acquisition, and adaptive focus control) that guide action-level credit assignment during GRPO training. The final-answer supervision is used only for the outcome reward. We did not report an explicit breakdown of reliance on annotations versus final-answer signals. We will revise the experimental section to include this clarification along with an ablation showing performance when intermediate annotations are withheld from the process rewards. revision: yes
-
Referee: [Further analyses] The robustness analysis under longer-range and deeper cross-region dependencies is described qualitatively; quantitative breakdowns (e.g., performance stratified by dependency length or number of regions) are needed to substantiate the claim that CAVE specifically improves handling of fragmented evidence.
Authors: We agree that quantitative stratification would strengthen the robustness claims. The current manuscript reports aggregate improvements and qualitative observations on longer-range dependencies. We will add tables and figures breaking down performance by dependency length and number of regions on both TRACER-Bench and public benchmarks to provide the requested quantitative evidence. revision: yes
Circularity Check
No circularity: CAVE's process signals and GRPO optimization form an independent credit-assignment method
full rationale
The paper defines CAVE as a structured process-reward approach that computes three explicit signals (belief update, evidence acquisition, adaptive focus control) to assign credit at the action level and then applies GRPO. These signals are presented as complementary heuristics derived from reasoning process internals rather than fitted to final-answer correctness or defined in terms of the optimization target. TRACER-Bench supplies external intermediate supervision and public benchmarks provide independent evaluation. No equations or self-citations reduce the claimed performance gain to a tautology or to parameters fitted on the same target metric. The derivation chain therefore remains non-circular and externally falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RCAVE(τ) = λbu Cbu(τ) + λea Cea(τ) + λaf Caf(τ) with teacher-forced log-likelihood gains
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tianmin Han, Zhitao Gong, Sina Samangooei, William Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, et al. Flamingo: a visual language model for few-shot l...
work page 2022
-
[2]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[3]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[4]
Qwen2.5-VL technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, et al. Qwen2.5-VL technical report, 2025
work page 2025
-
[5]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[6]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[7]
VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs
Shmuel Berman and Jia Deng. VLMs have tunnel vision: Evaluating nonlocal visual reasoning in leading VLMs. InAdvances in Neural Information Processing Systems, 2025. URL https: //vlmtunnel.github.io/. Spotlight presentation
work page 2025
-
[8]
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y . Charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, and Kuan Li. ...
work page 2026
-
[9]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[10]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, and Yu Cheng. OpenThinkIMG: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing “thinking with images” via reinforcement learning, 2025
work page 2025
-
[12]
Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-Focus: Adaptive visual search and zooming for multimodal reasoning via RL, 2025
work page 2025
-
[13]
VisualPRM: An effective process reward model for multimodal reasoning, 2025
Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang. VisualPRM: An effective process reward model for multimodal reasoning, 2025
work page 2025
-
[14]
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective ap- proach for multi-turn search agents. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=qkWP6phrvZ
work page 2026
-
[15]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Yang Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. 10
work page 2024
-
[16]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[17]
LLaV A-OneVision: Easy visual task transfer, 2024
Bo Li, Kaichen Zhang, Biao Zhang, et al. LLaV A-OneVision: Easy visual task transfer, 2024
work page 2024
-
[18]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. MME- RealWorld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
BLINK: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, et al. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[20]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, et al. Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[21]
DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. DDCoT: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[22]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. LLaV A-CoT: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. ICCV 2025
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025
Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Li- juan Wang. Point-RFT: Improving multimodal reasoning with visually grounded reinforcement finetuning, 2025
work page 2025
-
[24]
Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025
Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025
-
[25]
Visual programming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[26]
ViperGPT: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[27]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
VLM-R1: A stable and generalizable R1-style large vision-language model, 2025
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable R1-style large vision-language model, 2025
work page 2025
-
[29]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[30]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- VLThinker: An early exploration to complex vision-language reasoning via iterative self- improvement.arXiv preprint arXiv:2503.17352, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization
Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21805–21830, Suzhou, ...
-
[32]
GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. GiGPO: Group-in-group policy optimiza- tion for LLM agent training, 2025
work page 2025
-
[33]
xAI. RealWorldQA: A benchmark for evaluating real-world spatial understanding in multimodal ai models.https://huggingface.co/datasets/xai-org/RealworldQA, 2024
work page 2024
-
[34]
Lin Chen et al. Are we on the right way for evaluating large vision-language models? In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[35]
OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ ,
-
[36]
Accessed: 2026-04-26
work page 2026
-
[37]
Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-04-26
work page 2026
-
[38]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, et al. R1- onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[42]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025
-
[43]
SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. SFT or RL? an early investigation into training R1-like reasoning large vision-language models, 2025
work page 2025
-
[44]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[45]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 2024
work page 2024
-
[46]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, et al. HybridFlow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025. 12 A Case study on rollout trajectories Figure 5 and Figure 6 present a comparison of rollout trajectories before and after CA VE training. The ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.