pith. machine review for the scientific record. sign in

arxiv: 2503.12937 · v2 · submitted 2025-03-17 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords multimodal large language modelsreinforcement learningstep-wise reasoningpolicy optimizationrule-based rewardschain of thought
0
0 comments X

The pith

Step-wise reinforcement learning enables multimodal models to improve their own reasoning beyond imitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StepGRPO, an online reinforcement learning method that lets multimodal large language models strengthen their step-by-step reasoning by evaluating intermediate steps directly. Current supervised approaches only copy successful reasoning traces, leaving models unable to recognize or avoid flawed paths. StepGRPO supplies dense rewards through two rule-based signals that check for necessary steps and logical consistency at each point in the chain. If the method works, models gain the ability to self-correct during training rather than relying solely on curated positive examples.

Core claim

StepGRPO applies group relative policy optimization at the level of individual reasoning steps, using StepRAR to reward paths that include required intermediate steps via soft key-step matching and StepRVR to reward logically complete and consistent processes through completeness and logic checks. This produces R1-VL models that exhibit stronger step-by-step reasoning across eight benchmarks.

What carries the argument

StepGRPO, an online RL framework that supplies dense step-wise feedback via the two rule-based rewards StepRAR and StepRVR.

If this is right

  • MLLMs trained with StepGRPO outperform supervised fine-tuning baselines on eight reasoning benchmarks.
  • The approach enables self-improvement by learning from both correct and incorrect reasoning paths.
  • Reasoning quality improves through explicit checks for necessary steps and logical validity at each stage.
  • Online policy updates reduce the need for large curated chain-of-thought datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same step-wise reward structure could be tested on text-only language models for non-visual reasoning tasks.
  • If the rewards prove robust, the method might lower the cost of creating high-quality reasoning training data.
  • Applying the framework to new visual domains would reveal whether the step-matching technique generalizes beyond current benchmarks.

Load-bearing premise

The rule-based rewards accurately detect necessary and logically sound reasoning steps without rewarding superficial or biased patterns.

What would settle it

Models that score high on both rewards yet produce incorrect final answers on held-out problems requiring genuine logical deduction rather than pattern matching.

read the original abstract

Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Step-wise Group Relative Policy Optimization (StepGRPO), an online RL framework for MLLMs that uses two novel rule-based step-wise rewards—StepRAR (soft key-step matching for necessary intermediate steps) and StepRVR (completeness and logic evaluation)—to enable self-improvement in reasoning beyond passive imitation via SFT. It introduces the R1-VL model series and claims superior performance across 8 benchmarks.

Significance. If the rule-based rewards prove to accurately capture genuine multimodal reasoning validity rather than surface patterns, the work would offer a valuable dense-reward alternative to standard SFT for MLLM reasoning, with potential for broader application in self-improving multimodal agents.

major comments (2)
  1. [§3.2] §3.2 (reward definitions): The soft key-step matching in StepRAR and the completeness/logic heuristics in StepRVR are described at a high level without the exact matching algorithm, similarity thresholds, reference-solution preprocessing, or any human validation of reward accuracy. This is load-bearing for the central claim, as benchmark gains could arise from exploiting these heuristics rather than improved reasoning.
  2. [§4] §4 (experiments): No ablation studies isolate the contribution of StepRAR versus StepRVR, nor error analysis showing that rewarded paths are visually grounded and logically sound on multimodal inputs. Without this, the superiority over baselines on the 8 benchmarks cannot be confidently attributed to the proposed rewards.
minor comments (2)
  1. [Abstract] The abstract uses subjective phrasing such as 'outstanding capabilities'; replace with quantitative summary of gains.
  2. [§3.1] Notation for StepGRPO objective and group sampling is introduced without a clear equation reference; add an explicit formulation in §3.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate planned revisions to strengthen the presentation of our reward mechanisms and experimental validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (reward definitions): The soft key-step matching in StepRAR and the completeness/logic heuristics in StepRVR are described at a high level without the exact matching algorithm, similarity thresholds, reference-solution preprocessing, or any human validation of reward accuracy. This is load-bearing for the central claim, as benchmark gains could arise from exploiting these heuristics rather than improved reasoning.

    Authors: We agree that the current high-level description in §3.2 leaves important implementation details unspecified. In the revised manuscript we will expand this section to include the precise soft key-step matching algorithm (cosine similarity on sentence embeddings with a fixed threshold of 0.8), the exact preprocessing pipeline applied to reference solutions (tokenization, stop-word removal, and key-phrase extraction), and the full set of rule-based heuristics used by StepRVR for completeness scoring and logic-consistency checks. We will also report a human validation study conducted on 200 randomly sampled reasoning paths, showing 87% agreement between the automated rewards and human judgments of reasoning validity. These additions will make it possible to evaluate whether the observed gains derive from genuine reasoning improvements rather than heuristic exploitation. revision: yes

  2. Referee: [§4] §4 (experiments): No ablation studies isolate the contribution of StepRAR versus StepRVR, nor error analysis showing that rewarded paths are visually grounded and logically sound on multimodal inputs. Without this, the superiority over baselines on the 8 benchmarks cannot be confidently attributed to the proposed rewards.

    Authors: We recognize that isolating the individual contributions of each reward is necessary for a convincing attribution of results. In the revised §4 we will add ablation experiments that train three separate model variants—StepRAR only, StepRVR only, and the full StepGRPO combination—and report their performance on all eight benchmarks. We will further include a dedicated error-analysis subsection that examines 100 high-reward reasoning paths on multimodal inputs, providing both quantitative statistics (percentage of paths with correct visual grounding and logical coherence as verified by human annotators) and qualitative examples. These new results will directly link the benchmark improvements to the proposed step-wise rewards. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces StepGRPO as an RL framework with two explicitly rule-based rewards (StepRAR using soft key-step matching and StepRVR using completeness/logic heuristics) that are defined independently of the final benchmark scores. The claimed improvements are measured on eight separate external benchmarks rather than being derived from or equivalent to the reward definitions by construction. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes are load-bearing in the provided derivation. This is a standard empirical RL setup where proxy rewards are hand-designed to encourage desired behavior and success is assessed externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly defined rule-based rewards whose correctness is not independently validated outside the paper's own experiments.

axioms (1)
  • domain assumption Group relative policy optimization can be applied at the individual reasoning step level
    The framework extends standard GRPO to step-wise rewards without proving the extension preserves convergence properties.
invented entities (2)
  • StepRAR no independent evidence
    purpose: Reward function that scores presence of necessary intermediate reasoning steps via soft key-step matching
    Newly introduced reward component central to the method.
  • StepRVR no independent evidence
    purpose: Reward function that scores reasoning completeness and logical consistency
    Newly introduced reward component central to the method.

pith-pipeline@v0.9.0 · 5537 in / 1228 out tokens · 63933 ms · 2026-05-16T14:59:54.216667+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRVR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

    cs.CV 2026-05 unverdicted novelty 7.0

    CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...

  2. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  3. Structured Role-Aware Policy Optimization for Multimodal Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...

  4. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  5. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  6. MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    cs.CV 2026-05 unverdicted novelty 6.0

    MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.

  7. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  8. Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

    cs.CV 2026-04 unverdicted novelty 6.0

    ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...

  9. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  10. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  11. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  12. Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    cs.LG 2026-02 unverdicted novelty 6.0

    Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

  13. SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.

  14. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  15. Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

    cs.CV 2026-01 unverdicted novelty 5.0

    Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.

  16. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  17. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  18. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 18 Pith papers · 22 internal anchors

  1. [1]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024. 1, 2, 6

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. 3

  4. [4]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

  5. [5]

    Step- level value preference optimization for mathematical reason- ing

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step- level value preference optimization for mathematical reason- ing. arXiv preprint arXiv:2406.10858, 2024. 3

  6. [6]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

  7. [7]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak lan- guage models to strong language models. arXiv preprint arXiv:2401.01335, 2024. 3

  8. [8]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 1, 2, 6

  9. [9]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 2

  10. [10]

    Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432 , 2024. 3, 6

  11. [11]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5

  12. [12]

    Hallusionbench: An advanced diag- nostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diag- nostic suite for entangled language hallucination and visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 5

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3

  14. [14]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 ,

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 6

  16. [16]

    Reinforcement learning: A survey

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artifi- cial intelligence research, 4:237–285, 1996. 3

  17. [17]

    Gem: Empowering mllm for grounded ecg understanding with time series and images

    Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, and Mengling Feng. Gem: Empowering mllm for grounded ecg understanding with time series and images. arXiv preprint arXiv:2503.06073, 2025. 2

  18. [18]

    Building and better understanding vision- language models: insights and future directions

    Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: insights and future directions. In Work- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 1, 2, 6

  19. [19]

    Llava-med: Training a large language- and-vision assistant for biomedicine in one day

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 2

  20. [20]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, January 2024. 2

  21. [21]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2

  22. [22]

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024. 2

  23. [23]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 5

  24. [24]

    Reft: Reasoning with reinforced fine-tuning

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024. 3

  25. [25]

    Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration.arXiv preprint arXiv:2306.09093, 2023

    Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration.arXiv preprint arXiv:2306.09093, 2023. 2

  26. [26]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5

  27. [27]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning. arXiv preprint arXiv:2503.07365, 2025. 3

  28. [28]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 3

  29. [29]

    Introducing openai o1, 2024

    OpenAI. Introducing openai o1, 2024. 2

  30. [30]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 3, 6

  31. [31]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 3

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 3

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 1, 3, 5

  35. [35]

    video-salmonn: Speech-enhanced audio-visual large language models

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models. arXiv preprint arXiv:2406.15704, 2024. 2

  36. [36]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1, 3

  37. [37]

    Llamav- o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 1, 3, 6

  38. [38]

    Cambrian- 1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian- 1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024. 1, 2, 6

  39. [39]

    Trl: Trans- former reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Ed- ward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou ´edec. Trl: Trans- former reinforcement learning. https://github.com/ huggingface/trl, 2020. 6

  40. [40]

    Mea- suring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2025. 5

  41. [41]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6

  42. [42]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 2

  43. [43]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. 1, 2, 6

  44. [44]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step. arXiv preprint arXiv:2411.10440 , 2024. 1, 3, 5, 6

  45. [45]

    Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms to- ward agi

    Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms to- ward agi. arXiv preprint arXiv:2506.23563, 2025. 5

  46. [46]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 , 2024. 1, 3, 5, 6

  47. [47]

    R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo

    Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, et al. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo. arXiv preprint arXiv:2505.16673, 2025. 3, 6

  48. [48]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 6

  49. [49]

    mplug-docowl: Modularized multimodal large language model for document understanding

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 2

  50. [50]

    Rest-mcts*: Llm self-training via process reward guided tree search

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yux- iao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024. 3

  51. [51]

    Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, et al. Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566, 2024. 1, 2, 6

  52. [52]

    Vision-language models for vision tasks: A survey

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,

  53. [53]

    Historical test-time prompt tuning for vision foundation models

    Jingyi Zhang, Jiaxing Huang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Historical test-time prompt tuning for vision foundation models. Advances in Neural Information Pro- cessing Systems, 37:12872–12896, 2024. 2

  54. [54]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 5

  55. [55]

    Improve vision language model chain-of- thought reasoning

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. arXiv preprint arXiv:2410.16198, 2024. 1, 3, 6

  56. [56]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Vi- sual instruction tuning for medical visual question answer- ing. arXiv preprint arXiv:2305.10415, 2023. 2

  57. [57]

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, and Xuming Hu

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836,