VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
Pith reviewed 2026-06-26 08:32 UTC · model grok-4.3
The pith
Decoupling prompt evolution from answer verification enables reliable scaling of training data for visual mathematical reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriEvol is an iterative framework that first applies a type-aware evolution module to rewrite low-difficulty image-question seeds into harder, image-grounded prompts and then passes candidate answers through an HTV-Agent verifier that accepts them only after multi-source counter-evidence has failed to refute them. Scaling the verified evolved SFT data from 10K to 250K samples raises mean accuracy from 35.42 to 54.73; with backbone, SFT initialization, and GRPO recipe fixed, the pipeline contributes a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 is attributable to the evolved prompts and +2.06 to the verifier.
What carries the argument
The HTV-Agent verifier that accepts an answer only after multi-source counter-evidence has failed to refute it, together with the type-aware evolution module that rewrites low-difficulty seeds into harder image-grounded prompts.
If this is right
- Scaling evolved SFT data from 10K to 250K samples raises mean accuracy from 35.42 to 54.73 on the five-benchmark visual-math suite.
- With backbone and GRPO recipe fixed, VeriEvol adds +3.88 over an un-evolved RL baseline.
- Of the +3.88 gain, +1.82 is attributable to the evolved prompts and +2.06 to the HTV-Agent verifier.
- The verified data extends by adding new evolution routes or additional verifier channels.
- The full verifier trace released for every sample allows downstream auditing and further scaling.
Where Pith is reading between the lines
- The same separation of evolution and verification could be applied to scale reliable data in other multimodal reasoning domains such as science or coding.
- Releasing complete verifier traces may enable independent development of stronger or cheaper verifiers by the community.
- If the verifier remains reliable at even larger scales, the approach could support training runs with millions of verified visual-math examples.
Load-bearing premise
The HTV-Agent verifier can keep answer labels reliable at large scale without introducing systematic false accepts or false rejects that would corrupt the training signal.
What would settle it
A measurement showing that the verifier's false-accept or false-reject rate rises sharply once the dataset exceeds 100K samples, or an ablation in which replacing the verifier with a weaker labeler eliminates the reported +2.06 gain.
read the original abstract
Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeriEvol, an iterative framework that decouples prompt difficulty scaling via type-aware evolution operators from answer reliability via the HTV-Agent verifier (offline multi-source hypothesis-test falsification). It reports that scaling verified SFT data from 10K to 250K samples lifts mean accuracy on a five-benchmark visual-math suite from 35.42 to 54.73; with backbone, SFT init, and GRPO recipe fixed, VeriEvol yields a cumulative +3.88 over an un-evolved RL baseline, decomposed as +1.82 from evolved prompts and +2.06 from the verifier. The work releases prompts, data, models, code, and full verifier traces.
Significance. If the verifier's error rate remains controlled at 250K scale, the decoupling of evolution routes from verifiable labeling supplies a practical route to higher-quality RL data for multimodal math without assuming trusted labellers. The explicit release of verifier traces for every sample is a concrete strength that enables downstream auditing and extension.
major comments (2)
- [abstract and §3.2 (HTV-Agent description)] The attribution of +2.06 to HTV-Agent (abstract) is load-bearing for the central claim yet rests on the untested assumption that offline hypothesis-test falsification maintains stable precision/recall as prompt diversity grows; no held-out quantitative evaluation of false-accept or false-reject rates, nor any scaling analysis of verifier error with data volume, is supplied.
- [abstract and experimental results section] The reported decomposition (+1.82 prompts, +2.06 verifier) requires an ablation that isolates each component while holding the other fixed; the abstract states the numbers but supplies neither the corresponding table rows nor statistical significance tests for the deltas.
minor comments (2)
- [abstract] The five-benchmark suite and exact metric (mean accuracy) should be named explicitly in the abstract rather than referenced only as 'five-benchmark visual-math suite'.
- [§3] Notation for evolution routes and verifier channels is introduced without a compact summary table; a single table listing each route, its operator, and the verifier channels would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on VeriEvol. The two major comments highlight important aspects of evidence strength for the verifier contribution and the reported decomposition. We address each point below and commit to revisions that strengthen the manuscript without altering the core claims.
read point-by-point responses
-
Referee: [abstract and §3.2 (HTV-Agent description)] The attribution of +2.06 to HTV-Agent (abstract) is load-bearing for the central claim yet rests on the untested assumption that offline hypothesis-test falsification maintains stable precision/recall as prompt diversity grows; no held-out quantitative evaluation of false-accept or false-reject rates, nor any scaling analysis of verifier error with data volume, is supplied.
Authors: We agree that direct held-out metrics on verifier error rates would provide stronger grounding for attributing the +2.06 gain specifically to HTV-Agent rather than to downstream effects. The reported gain is measured via the controlled RL performance delta (evolved+verified vs. evolved+unverified data) with all other factors fixed, and the release of full verifier traces enables external auditing. However, we acknowledge the absence of explicit precision/recall scaling curves. In revision we will add a held-out evaluation set, report false-accept and false-reject rates, and include a scaling plot of verifier error versus data volume. revision: yes
-
Referee: [abstract and experimental results section] The reported decomposition (+1.82 prompts, +2.06 verifier) requires an ablation that isolates each component while holding the other fixed; the abstract states the numbers but supplies neither the corresponding table rows nor statistical significance tests for the deltas.
Authors: The decomposition is obtained from two controlled ablations described in the experimental results section: one holding the verifier fixed while varying prompt evolution, and one holding evolved prompts fixed while varying verification. We agree that presenting these as explicit table rows with significance tests would improve clarity and allow readers to assess the deltas directly. In the revision we will add a dedicated ablation table containing the isolated contributions together with bootstrap confidence intervals or paired significance tests for each reported delta. revision: yes
Circularity Check
No circularity; empirical scaling results are externally benchmarked
full rationale
The paper reports measured accuracy lifts on five external visual-math benchmarks when scaling evolved SFT data from 10K to 250K and when adding the HTV-Agent verifier under fixed GRPO. No equations, uniqueness theorems, or self-citations are invoked to derive the gains; the +1.82 / +2.06 decomposition is obtained by ablation with held-fixed backbone and recipe. The verifier reliability is an empirical assumption whose validity is left to the released traces rather than enforced by definition. This is a standard empirical pipeline paper with no load-bearing self-referential step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
WizardLM: Empowering large language models to follow complex instructions
CanXu,QingfengSun,KaiZheng,XiuboGeng,PuZhao,JiazhanFeng,ChongyangTao,andDaxinJiang. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023
Pith/arXiv arXiv 2023
-
[2]
DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[3]
Kimi k1.5: Scaling reinforcement learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, and others. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv preprint arXiv:2501.12599, 2025
Pith/arXiv arXiv 2025
-
[4]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyang Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proceedings of ICLR, 2024
2024
-
[5]
MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? In Proceedings of ECCV, 2024
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? In Proceedings of ECCV, 2024
2024
-
[6]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, and others. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In Proceedings of CVPR, 2024
2024
-
[7]
Measuring multimodal mathematical reasoning with MATH-Vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-Vision dataset. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024
2024
-
[8]
OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of ACL, 2024. 14
2024
-
[9]
DynaMath: A dynamic visual benchmarkforevaluatingmathematicalreasoningrobustnessofvisionlanguagemodels
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. DynaMath: A dynamic visual benchmarkforevaluatingmathematicalreasoningrobustnessofvisionlanguagemodels. InInternationalConference on Learning Representations (ICLR), 2025
2025
-
[10]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024
Pith/arXiv arXiv 2024
-
[11]
MV-MATH: Evaluating multimodal math reasoning in multi-visual contexts
Peijie Wang, Zhong-Zhi Li, Fei Yin, Xin Yang, Dekang Ran, and Cheng-Lin Liu. MV-MATH: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of CVPR, 2025
2025
-
[12]
M3Kang: Evaluating multilingual multimodal mathematical reasoning in vision-language models
Aleix Torres-Camps, Nathaniel Mitrani Hadida, Víctor Conchello Vendrell, Àlex Batlle Casellas, Arnau Padrés Masdemont, and Jordi Ros-Giralt. M3Kang: Evaluating multilingual multimodal mathematical reasoning in vision-language models. arXiv preprint arXiv:2601.16218, 2026
arXiv 2026
-
[13]
Qwen Team. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[14]
InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025
Pith/arXiv arXiv 2025
-
[15]
Honey-Data-15M: A large-scale open multimodal instruction-tuning dataset
Open-Bee Team. Honey-Data-15M: A large-scale open multimodal instruction-tuning dataset. https:// huggingface.co/datasets/Open-Bee/Honey-Data-15M, 2025
2025
-
[16]
MMFineReason: Closing the multimodal reasoning gap via open data-centric methods
Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. MMFineReason: Closing the multimodal reasoning gap via open data-centric methods. arXiv preprint arXiv:2601.21821, 2026
arXiv 2026
-
[17]
MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale. In Proceedings of ACL, 2025
2025
-
[18]
VisualWebInstruct: Scaling up multimodal instruction data through web search
Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. VisualWebInstruct: Scaling up multimodal instruction data through web search. arXiv preprint arXiv:2503.10582, 2025
arXiv 2025
-
[19]
MathCoder-VL: Bridging vision and code for enhanced multimodal mathematical reasoning
Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, and Hongsheng Li. MathCoder-VL: Bridging vision and code for enhanced multimodal mathematical reasoning. In Findings of ACL, 2025
2025
-
[20]
MMEvol: Empowering multimodal large language models with Evol-Instruct
RunLuo,HaonanZhang,LongzeChen,Ting-EnLin,XiongLiu,YuchuanWu,MinYang,MinzhengWang,Pengpeng Zeng, Lianli Gao, and others. MMEvol: Empowering multimodal large language models with Evol-Instruct. arXiv preprint arXiv:2409.05840, 2024
arXiv 2024
-
[21]
Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, and Meng Cao. MR. Judge: Multimodal reasoner as a judge. arXiv preprint arXiv:2505.13403, 2025
arXiv 2025
-
[22]
Judge Anything: MLLM as a judge across any modality
Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, and others. Judge Anything: MLLM as a judge across any modality. arXiv preprint arXiv:2503.17489, 2025
arXiv 2025
-
[23]
Visual-RFT: Visual reinforcement fine-tuning
ZiyuLiu,ZeyiSun,YuhangZang,XiaoyiDong,YuhangCao,HaodongDuan,DahuaLin,andJiaqiWang. Visual-RFT: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025
Pith/arXiv arXiv 2025
-
[24]
Vision-R1: Incentivizing reasoning capability in multimodal large language models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025
Pith/arXiv arXiv 2025
-
[25]
R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025
Pith/arXiv arXiv 2025
-
[26]
ZeyuLiu,YuhangLiu,GuanghaoZhu,CongkaiXie,ZhenLi,JianboYuan,XinyaoWang,QingLi,Shing-ChiCheung, Shengyu Zhang, Fei Wu, and Hongxia Yang. Infi-MMR: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models. arXiv preprint arXiv:2505.23091, 2025. 15
arXiv 2025
-
[27]
MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and others. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025
Pith/arXiv arXiv 2025
-
[28]
VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
Pith/arXiv arXiv 2025
-
[29]
Skywork R1V2: Multimodal hybrid reinforcement learning for reasoning
Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork R1V2: Multimodal hybrid reinforcement learning for reasoning. arXiv preprint arXiv:2504.16656, 2025
arXiv 2025
-
[30]
Open Vision Reasoner: Transferring linguistic cognitive behavior for visual reasoning
Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, and others. Open Vision Reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255, 2025
arXiv 2025
-
[31]
Dual-uncertainty guided policy learning for multimodal reasoning
Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, and Dong Yu. Dual-uncertainty guided policy learning for multimodal reasoning. arXiv preprint arXiv:2510.01444, 2025
arXiv 2025
-
[32]
Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang, Simon Jenni, Kushal Kafle, Ruoxi Jia, and Jing Shi. More than the final answer: Improving visual extraction and logical consistency in vision-language models. arXiv preprint arXiv:2512.12487, 2025
arXiv 2025
-
[33]
V-Zero: Self-improving multimodal reasoning with zero annotation
Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, and Wei Chen. V-Zero: Self-improving multimodal reasoning with zero annotation. arXiv preprint arXiv:2601.10094, 2026
arXiv 2026
-
[34]
Meghana Sunil, Manikandarajan Venmathimaran, and Muthu Subash Kavitha. iReasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models. In Findings of the Association for Computational Linguistics (ACL), 2026. arXiv:2601.05877
Pith/arXiv arXiv 2026
-
[35]
Fromnarrowtopanoramicvision: Attention-guided cold-start reshapes multimodal reasoning
RuilinLuo,ChufanShi,YizhenZhang,ChengYang,SongtaoJiang,TongkunGuan,RuizheChen,RuihangChu,Peng Wang,MingkunYang,YujiuYang,JunyangLin,andZhiboYang. Fromnarrowtopanoramicvision: Attention-guided cold-start reshapes multimodal reasoning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2603.03825
arXiv 2026
-
[36]
PaLMR: Towards faithful visual reasoning via multimodal process alignment
Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, and Shiguo Lian. PaLMR: Towards faithful visual reasoning via multimodal process alignment. In CVPR Findings, 2026. arXiv:2603.06652
Pith/arXiv arXiv 2026
-
[37]
Visually-guided policy optimization for multimodal reasoning
Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026
Pith/arXiv arXiv 2026
-
[38]
Attend to evidence: Evidence-anchored spatial attention supervision for multimodal RLVR
Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, and Yue Wang. Attend to evidence: Evidence-anchored spatial attention supervision for multimodal RLVR. arXiv preprint arXiv:2605.30912, 2026
Pith/arXiv arXiv 2026
-
[39]
TRON: Targeted rule-verifiable online environments for visual reasoning RL
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, and Jin Sun. TRON: Targeted rule-verifiable online environments for visual reasoning RL. arXiv preprint arXiv:2606.01599, 2026
Pith/arXiv arXiv 2026
-
[40]
See less, see right: Bi-directional perceptual shaping for multimodal reasoning
Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, and Rui Wang. See less, see right: Bi-directional perceptual shaping for multimodal reasoning. arXiv preprint arXiv:2512.22120, 2026
arXiv 2026
-
[41]
R1-V: Reinforcing super generalization ability in vision-language models with less than three dollars
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, Vinci, and Zihao Yue. R1-V: Reinforcing super generalization ability in vision-language models with less than three dollars. Technical report, 2025.https://github.com/ StarsfieldAI/R1-V
2025
-
[42]
OpenVLThinker: Complex vision-language reasoning via iterative SFT-RL cycles
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. OpenVLThinker: Complex vision-language reasoning via iterative SFT-RL cycles. arXiv preprint arXiv:2503.17352, 2025
Pith/arXiv arXiv 2025
-
[43]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. ThinkLite-VL: Reasoning-enhanced vision-language models with sample-efficient reinforcement fine-tuning. arXiv preprint arXiv:2504.07934, 2025. 16
arXiv 2025
-
[44]
VLAA-Thinker: SFT or RL? An early investigation into training R1-like reasoning large vision-language models
HardyChen, HaoqinTu, FaliWang, HuiLiu, XianfengTang, XinyaDu, YuyinZhou, andCihangXie. VLAA-Thinker: SFT or RL? An early investigation into training R1-like reasoning large vision-language models. Transactions on Machine Learning Research, 2025
2025
-
[45]
WeThink: Toward general-purpose vision-language reasoning via reinforcement learning
Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, and Ruimao Zhang. WeThink: Toward general-purpose vision-language reasoning via reinforcement learning. arXiv preprint arXiv:2506.07905, 2025
arXiv 2025
-
[46]
We-Math 2.0: A versatile MathBook system for incentivizing visual mathematical reasoning
Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, and Honggang Zhang. We-Math 2.0: A versatile MathBook system for incentivizing visual mathematical reasoning. arXiv preprint arXiv:2508.10433, 2025
arXiv 2025
-
[47]
NoisyRollout: Reinforcing visual reasoning with data augmentation
Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. NoisyRollout: Reinforcing visual reasoning with data augmentation. Advances in Neural Information Processing Systems, 2025. arXiv:2504.13055
arXiv 2025
-
[48]
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing “thinking with images” via reinforcement learning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2505.14362
Pith/arXiv arXiv 2026
-
[49]
MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources
Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, and Shijian Lu. MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources. arXiv preprint arXiv:2509.21268, 2025
arXiv 2025
-
[50]
ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning
Yuhao Chen, Shubin Huang, Hongyi Yu, Long Li, Zihan Wang, Xinyi Wang, Yuwei Yan, Lifan Yuan, Zhihao Bai, Mengmeng Liu, Jiongnan Liu, Mengjie Wang, Wei Tang, Liuxin Zhang, Junlong Wu, Mingsheng Long, Hao Zhao, Jianzhuang Liu, and Yiming Yang. ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning. arXiv preprint arXiv:2506.04207, 2025
arXiv 2025
-
[51]
Perception-aware policy optimization for multimodal reasoning
Zhenghai Wang, Wenxuan Zhang, Wenhao Yu, Tianhao Wu, Heng Ji, Hongming Zhang, Dong Yu, Manling Li, and Kaixin Ma. Perception-aware policy optimization for multimodal reasoning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2507.06448
Pith/arXiv arXiv 2026
-
[52]
OpenMMReasoner: Pushing the frontiers of multimodal reasoning with an open and reproducible recipe
Kaichen Lin, Bo Li, Yuanhan Zhang, Yifei Sun, Yixiu Liu, Pengyun Wang, Yuhao Dong, Wenjia Liu, Xinyu Wang, Zhiqi Bu, Ziwei Liu, and Chunyuan Li. OpenMMReasoner: Pushing the frontiers of multimodal reasoning with an open and reproducible recipe. arXiv preprint arXiv:2511.16334, 2025
arXiv 2025
-
[53]
Self-Refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...
2023
-
[54]
Reflexion: Language agents with verbal reinforcement learning
NoahShinn,FedericoCassano,EdwardBerman,AshwinGopinath,KarthikNarasimhan,andShunyuYao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023
2023
-
[55]
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
2023
-
[56]
CRITIC: Large language models can self-correct with tool-interactive critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In International Conference on Learning Representations (ICLR), 2024
2024
-
[57]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024
2024
-
[58]
Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
-
[59]
Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper 17 Snoek, Jeffrey Pennington, J...
2024
-
[60]
Briefly summarize the verification’s conclusion (1--2 sentences). 25
-
[61]
Assess the QUALITY of the verification: is it logically sound, or does it contain self- contradictions, arithmetic errors, or unsupported claims?
-
[62]
-- If verification is low-quality or self-contradictory, trust the initial answer
Decide the final answer: -- If verification is high-quality AND explicitly rejects the initial answer, trust the verification. -- If verification is low-quality or self-contradictory, trust the initial answer. -- If they agree, keep the answer
-
[63]
If the Solver and Verifier disagree AND you are not confident in either, output <require_rethink> true</require_rethink>
-
[64]
Examples: <final_answer>A</final_answer> or <final_answer>42</final_answer>
Output your exact final answer inside the tag <final_answer>X</final_answer>. Examples: <final_answer>A</final_answer> or <final_answer>42</final_answer>
-
[65]
User prompt template
Output your confidence as <confidence>0--100</confidence>. User prompt template. Question: {question} Context: {context} (omitted if empty) Options: {choices} (omitted if not multiple choice) Solver’s answer: {hypothesis} Verification report: {verification_text} Post-processing.Thedecider’sresponseisparsedbyregularexpressionsfor <final_answer>, <confidenc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.