Recognition: 3 theorem links
· Lean TheoremR1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Pith reviewed 2026-05-16 14:59 UTC · model grok-4.3
The pith
Step-wise reinforcement learning enables multimodal models to improve their own reasoning beyond imitation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StepGRPO applies group relative policy optimization at the level of individual reasoning steps, using StepRAR to reward paths that include required intermediate steps via soft key-step matching and StepRVR to reward logically complete and consistent processes through completeness and logic checks. This produces R1-VL models that exhibit stronger step-by-step reasoning across eight benchmarks.
What carries the argument
StepGRPO, an online RL framework that supplies dense step-wise feedback via the two rule-based rewards StepRAR and StepRVR.
If this is right
- MLLMs trained with StepGRPO outperform supervised fine-tuning baselines on eight reasoning benchmarks.
- The approach enables self-improvement by learning from both correct and incorrect reasoning paths.
- Reasoning quality improves through explicit checks for necessary steps and logical validity at each stage.
- Online policy updates reduce the need for large curated chain-of-thought datasets.
Where Pith is reading between the lines
- The same step-wise reward structure could be tested on text-only language models for non-visual reasoning tasks.
- If the rewards prove robust, the method might lower the cost of creating high-quality reasoning training data.
- Applying the framework to new visual domains would reveal whether the step-matching technique generalizes beyond current benchmarks.
Load-bearing premise
The rule-based rewards accurately detect necessary and logically sound reasoning steps without rewarding superficial or biased patterns.
What would settle it
Models that score high on both rewards yet produce incorrect final answers on held-out problems requiring genuine logical deduction rather than pattern matching.
read the original abstract
Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Step-wise Group Relative Policy Optimization (StepGRPO), an online RL framework for MLLMs that uses two novel rule-based step-wise rewards—StepRAR (soft key-step matching for necessary intermediate steps) and StepRVR (completeness and logic evaluation)—to enable self-improvement in reasoning beyond passive imitation via SFT. It introduces the R1-VL model series and claims superior performance across 8 benchmarks.
Significance. If the rule-based rewards prove to accurately capture genuine multimodal reasoning validity rather than surface patterns, the work would offer a valuable dense-reward alternative to standard SFT for MLLM reasoning, with potential for broader application in self-improving multimodal agents.
major comments (2)
- [§3.2] §3.2 (reward definitions): The soft key-step matching in StepRAR and the completeness/logic heuristics in StepRVR are described at a high level without the exact matching algorithm, similarity thresholds, reference-solution preprocessing, or any human validation of reward accuracy. This is load-bearing for the central claim, as benchmark gains could arise from exploiting these heuristics rather than improved reasoning.
- [§4] §4 (experiments): No ablation studies isolate the contribution of StepRAR versus StepRVR, nor error analysis showing that rewarded paths are visually grounded and logically sound on multimodal inputs. Without this, the superiority over baselines on the 8 benchmarks cannot be confidently attributed to the proposed rewards.
minor comments (2)
- [Abstract] The abstract uses subjective phrasing such as 'outstanding capabilities'; replace with quantitative summary of gains.
- [§3.1] Notation for StepGRPO objective and group sampling is introduced without a clear equation reference; add an explicit formulation in §3.1.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate planned revisions to strengthen the presentation of our reward mechanisms and experimental validation.
read point-by-point responses
-
Referee: [§3.2] §3.2 (reward definitions): The soft key-step matching in StepRAR and the completeness/logic heuristics in StepRVR are described at a high level without the exact matching algorithm, similarity thresholds, reference-solution preprocessing, or any human validation of reward accuracy. This is load-bearing for the central claim, as benchmark gains could arise from exploiting these heuristics rather than improved reasoning.
Authors: We agree that the current high-level description in §3.2 leaves important implementation details unspecified. In the revised manuscript we will expand this section to include the precise soft key-step matching algorithm (cosine similarity on sentence embeddings with a fixed threshold of 0.8), the exact preprocessing pipeline applied to reference solutions (tokenization, stop-word removal, and key-phrase extraction), and the full set of rule-based heuristics used by StepRVR for completeness scoring and logic-consistency checks. We will also report a human validation study conducted on 200 randomly sampled reasoning paths, showing 87% agreement between the automated rewards and human judgments of reasoning validity. These additions will make it possible to evaluate whether the observed gains derive from genuine reasoning improvements rather than heuristic exploitation. revision: yes
-
Referee: [§4] §4 (experiments): No ablation studies isolate the contribution of StepRAR versus StepRVR, nor error analysis showing that rewarded paths are visually grounded and logically sound on multimodal inputs. Without this, the superiority over baselines on the 8 benchmarks cannot be confidently attributed to the proposed rewards.
Authors: We recognize that isolating the individual contributions of each reward is necessary for a convincing attribution of results. In the revised §4 we will add ablation experiments that train three separate model variants—StepRAR only, StepRVR only, and the full StepGRPO combination—and report their performance on all eight benchmarks. We will further include a dedicated error-analysis subsection that examines 100 high-reward reasoning paths on multimodal inputs, providing both quantitative statistics (percentage of paths with correct visual grounding and logical coherence as verified by human annotators) and qualitative examples. These new results will directly link the benchmark improvements to the proposed step-wise rewards. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces StepGRPO as an RL framework with two explicitly rule-based rewards (StepRAR using soft key-step matching and StepRVR using completeness/logic heuristics) that are defined independently of the final benchmark scores. The claimed improvements are measured on eight separate external benchmarks rather than being derived from or equivalent to the reward definitions by construction. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes are load-bearing in the provided derivation. This is a standard empirical RL setup where proxy rewards are hand-designed to encourage desired behavior and success is assessed externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Group relative policy optimization can be applied at the individual reasoning step level
invented entities (2)
-
StepRAR
no independent evidence
-
StepRVR
no independent evidence
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRVR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Structured Role-Aware Policy Optimization for Multimodal Reasoning
SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
-
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
- [1]
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3
work page 1901
-
[5]
Step- level value preference optimization for mathematical reason- ing
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step- level value preference optimization for mathematical reason- ing. arXiv preprint arXiv:2406.10858, 2024. 3
-
[6]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak lan- guage models to strong language models. arXiv preprint arXiv:2401.01335, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models
Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432 , 2024. 3, 6
-
[11]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diag- nostic suite for entangled language hallucination and visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 5
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Reinforcement learning: A survey
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artifi- cial intelligence research, 4:237–285, 1996. 3
work page 1996
-
[17]
Gem: Empowering mllm for grounded ecg understanding with time series and images
Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, and Mengling Feng. Gem: Empowering mllm for grounded ecg understanding with time series and images. arXiv preprint arXiv:2503.06073, 2025. 2
-
[18]
Building and better understanding vision- language models: insights and future directions
Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: insights and future directions. In Work- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 1, 2, 6
work page 2024
-
[19]
Llava-med: Training a large language- and-vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 2
-
[20]
Llava-next: Im- proved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, January 2024. 2
work page 2024
-
[21]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2
work page 2024
-
[22]
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024. 2
-
[23]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Reft: Reasoning with reinforced fine-tuning
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024. 3
-
[25]
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration.arXiv preprint arXiv:2306.09093, 2023. 2
-
[26]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning. arXiv preprint arXiv:2503.07365, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
- [29]
-
[30]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 3, 6
work page internal anchor Pith review arXiv 2025
-
[31]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3
work page 2018
-
[32]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 3
work page 2023
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
video-salmonn: Speech-enhanced audio-visual large language models
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models. arXiv preprint arXiv:2406.15704, 2024. 2
-
[36]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Llamav- o1: Rethinking step-by-step visual reasoning in llms
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 1, 3, 6
-
[38]
Cambrian- 1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian- 1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024. 1, 2, 6
-
[39]
Trl: Trans- former reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Ed- ward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou ´edec. Trl: Trans- former reinforcement learning. https://github.com/ huggingface/trl, 2020. 6
work page 2020
-
[40]
Mea- suring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2025. 5
work page 2025
-
[41]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Next-gpt: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 2
-
[43]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step. arXiv preprint arXiv:2411.10440 , 2024. 1, 3, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms to- ward agi
Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms to- ward agi. arXiv preprint arXiv:2506.23563, 2025. 5
-
[46]
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 , 2024. 1, 3, 5, 6
-
[47]
R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo
Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, et al. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo. arXiv preprint arXiv:2505.16673, 2025. 3, 6
-
[48]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
mplug-docowl: Modularized multimodal large language model for document understanding
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 2
-
[50]
Rest-mcts*: Llm self-training via process reward guided tree search
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yux- iao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024. 3
- [51]
-
[52]
Vision-language models for vision tasks: A survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,
-
[53]
Historical test-time prompt tuning for vision foundation models
Jingyi Zhang, Jiaxing Huang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Historical test-time prompt tuning for vision foundation models. Advances in Neural Information Pro- cessing Systems, 37:12872–12896, 2024. 2
work page 2024
-
[54]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 5
work page 2024
-
[55]
Improve vision language model chain-of- thought reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. arXiv preprint arXiv:2410.16198, 2024. 1, 3, 6
-
[56]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Vi- sual instruction tuning for medical visual question answer- ing. arXiv preprint arXiv:2305.10415, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.