pith. sign in

arxiv: 2508.19652 · v2 · submitted 2025-08-27 · 💻 cs.CV

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Pith reviewed 2026-05-18 21:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsself-rewarding reinforcement learningvisual reasoningreasoning decompositionvisual hallucinationslanguage shortcuts
0
0 comments X

The pith

Vision-language models can improve their own visual reasoning by generating self-contained descriptions and rewarding themselves without external supervisors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called Vision SR1 that splits VLM reasoning into a visual part and a language part. The model first writes a description of what it sees that must stand alone to answer the question, then the same model is asked to reason from that description alone to score how good the visual part was. These two scores are combined in a decoupled way during reinforcement learning so the model gets explicit signals about whether it actually looked at the image. If this works, VLMs would need fewer external reward models and could reduce cases where they ignore the picture and just guess from text patterns.

Core claim

Vision SR1 decomposes VLM reasoning into visual reasoning and language reasoning components. The model is prompted to first produce self-contained visual descriptions that suffice to answer the question without the original image. The same model is then reprompted using only those descriptions to compute a visual reward, which is combined with a language reasoning reward through a decoupled reward-advantage framework. This self-rewarding loop provides denser visual supervision during post-training and leads to better performance on vision-language tasks.

What carries the argument

Reasoning decomposition followed by self-reprompting: the VLM generates a visual description, then receives its own output as the sole input to score visual quality.

If this is right

  • Models trained this way show fewer visual hallucinations on standard benchmarks.
  • Performance gains appear across multiple vision-language tasks without added GPU cost for external reward models.
  • The approach reduces the model's tendency to answer from text priors alone.
  • Training remains feasible on the same hardware used for ordinary fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition idea might let models self-supervise other intermediate steps such as spatial grounding or object relations.
  • If the self-reward generalizes, it could lower the barrier to applying RL to new multimodal domains that lack ready-made external verifiers.

Load-bearing premise

Reprompting the model with only its own generated visual description produces an accurate and unbiased measure of visual quality without circular dependence on language priors.

What would settle it

Run the trained model on a set of images with deliberately altered visual details and check whether the computed visual rewards still track independent human judgments of whether the description matches the altered image.

Figures

Figures reproduced from arXiv: 2508.19652 by Chengsong Huang, Dian Yu, Dong Yu, Fuxiao Liu, Haitao Mi, Jingxi Che, Jordan Boyd-Graber, Rui Liu, Wenhao Yu, Zhenwen Liang, Zongxia Li.

Figure 1
Figure 1. Figure 1: Overall framework of Vision-SR1. During RL training, the VLM performs two rollouts. In the first pass, the model takes an image–query pair and generates a structured output (visual perception, CoT reasoning, and answer), with answer reward computed against the ground truth. In the second pass, the model is re-prompted to answer using only query and its generated visual perception. If the correct answer is … view at source ↗
Figure 2
Figure 2. Figure 2: We prompt Qwen-2.5-VL-7B to create the SFT cold-start dataset to learn the ideal format to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Vision SR1, a three-stage self-rewarding reinforcement learning framework for vision-language models. It decomposes reasoning into visual and language components by first prompting the model to generate self-contained visual descriptions sufficient to answer questions without the image, then reprompts the identical VLM with only that description to compute a visual reward signal. A decoupled reward-advantage framework separately calculates advantages for visual and language rewards before joint optimization. The central claims are that this yields improved visual reasoning, reduced hallucinations and language shortcuts across VLM tasks, and greater efficiency than external visual reward models without extra GPU overhead.

Significance. If the self-reward mechanism can be shown to provide an independent and faithful signal of visual grounding, the approach would offer a scalable, low-overhead route to post-training VLMs that directly supervises intermediate visual reasoning rather than relying solely on final-answer matching. The efficiency claim relative to methods that host separate reward models is a practical strength, and the decomposition idea directly targets a documented weakness in current VLM alignment.

major comments (2)
  1. [Method (visual reward and self-containment validation)] Method section on visual reward computation: the claim that reprompting the same VLM with only the generated visual description produces a reliable, non-circular visual reward is load-bearing for the central thesis. Because generation and reward steps share parameters and training history, any internally consistent hallucination or language prior can yield high self-reward even when the description diverges from the image; the manuscript must supply concrete validation (e.g., correlation with human visual-faithfulness judgments or an external verifier on a held-out subset) rather than relying on the decoupled advantage framework alone to mitigate this risk.
  2. [Experiments and results] Experiments section: the abstract asserts improvements in visual reasoning, hallucination mitigation, and reduced language shortcuts, yet the provided text supplies no quantitative results, baselines, ablation tables, or error analysis. To support the efficiency and effectiveness claims, the results must include direct comparisons against both standard RLHF and external-reward baselines with metrics that isolate visual grounding (e.g., visual entailment accuracy or hallucination rate on POPE-style probes).
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., average improvement on a visual-reasoning benchmark) to allow readers to gauge effect size without reading the full paper.
  2. [Method] Notation for the decoupled advantage framework should be introduced with an explicit equation showing how visual and language advantages are computed and combined; the current high-level description leaves the precise loss objective unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the visual reward mechanism and the presentation of experimental results. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Method (visual reward and self-containment validation)] Method section on visual reward computation: the claim that reprompting the same VLM with only the generated visual description produces a reliable, non-circular visual reward is load-bearing for the central thesis. Because generation and reward steps share parameters and training history, any internally consistent hallucination or language prior can yield high self-reward even when the description diverges from the image; the manuscript must supply concrete validation (e.g., correlation with human visual-faithfulness judgments or an external verifier on a held-out subset) rather than relying on the decoupled advantage framework alone to mitigate this risk.

    Authors: We agree that explicit validation of the visual reward's faithfulness is essential to support the central claims. While the decoupled reward-advantage framework is designed to isolate visual and language contributions during optimization, it does not by itself prove that the self-reward signal is non-circular or grounded in the image. In the revised manuscript we will add a dedicated validation subsection that reports (1) Pearson correlation between the computed visual rewards and human annotations of visual faithfulness on a held-out subset of 500 examples, and (2) agreement rates with an external vision-language verifier on the same subset. These additions will be placed in the method and experiments sections to directly address the concern. revision: yes

  2. Referee: [Experiments and results] Experiments section: the abstract asserts improvements in visual reasoning, hallucination mitigation, and reduced language shortcuts, yet the provided text supplies no quantitative results, baselines, ablation tables, or error analysis. To support the efficiency and effectiveness claims, the results must include direct comparisons against both standard RLHF and external-reward baselines with metrics that isolate visual grounding (e.g., visual entailment accuracy or hallucination rate on POPE-style probes).

    Authors: We acknowledge that the initial submission did not present the quantitative results with sufficient prominence or detail. The full manuscript contains experimental evaluations, but we will substantially expand the experiments section in the revision. We will add tables comparing Vision SR1 against standard RLHF and external-reward baselines on multiple VLM benchmarks, reporting visual-grounding-specific metrics including visual entailment accuracy and hallucination rates measured via POPE-style probes. Ablation studies isolating the visual-reward component and error analysis of remaining failure cases will also be included to substantiate the efficiency and effectiveness claims. revision: yes

Circularity Check

1 steps flagged

Self-reward via reprompting same VLM creates internal dependence for visual signal

specific steps
  1. self definitional [Abstract]
    "To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward."

    The visual reward used to supervise the model is generated by feeding the model's own visual description back into the same VLM (no image, no external model). This makes the reward definitionally dependent on the model's internal language priors and consistency, so any systematic hallucination that is internally coherent will receive high self-reward, reducing the claimed independent validation of visual grounding to a self-referential loop.

full rationale

The central mechanism defines the visual reward by reprompting the identical model on its own generated description, which is load-bearing for the claim of reliable self-contained visual grounding without external supervision. This step is quoted directly from the abstract and matches the self-definitional pattern because the reward signal is constructed from the model's output rather than an independent verifier. However, the paper still reports external task benchmarks and efficiency gains, so the circularity is partial rather than total reduction of the entire result to its inputs. No equations or self-citations are shown to force the outcome by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper would be needed to enumerate all free parameters and axioms. The central method rests on the assumption that self-generated visual descriptions can serve as a faithful proxy for visual reward.

axioms (1)
  • domain assumption Generated visual descriptions are sufficient to answer the question without referring back to the input image.
    This premise is required for the reprompting step to compute visual reward.

pith-pipeline@v0.9.0 · 5832 in / 1251 out tokens · 40888 ms · 2026-05-18T21:00:15.498409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  2. Visual-Advantage On-Policy Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

  3. CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

  4. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  5. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  6. VIDEOP2R: Video Understanding from Perception to Reasoning

    cs.CV 2025-11 conditional novelty 7.0

    VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.

  7. DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

    cs.AI 2025-09 unverdicted novelty 7.0

    DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.

  8. EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.

  9. RISE: Reliable Improvement in Self-Evolving Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.

  10. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  11. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  12. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  13. Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

    cs.AI 2026-05 unverdicted novelty 6.0

    Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.

  14. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  15. VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

    cs.CV 2026-04 conditional novelty 6.0

    VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...

  16. SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 6.0

    SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

  17. Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

    cs.CV 2026-03 unverdicted novelty 6.0

    REFORM is a three-stage reasoning curriculum plus the ROM dataset that achieves state-of-the-art generalization on multimodal manipulation detection benchmarks.

  18. DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

    cs.AI 2025-09 unverdicted novelty 6.0

    DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.

  19. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  20. Semantic-Enriched Latent Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.

  21. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 20 Pith papers · 15 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    URL https://arxiv.org/abs/2502.13923. 10 Vision-SR1: Self-Rewarding VLM via Reasoning Decomposition Maurits Bleeker, Mariya Hendriksen, Andrew Yates, and Maarten de Rijke. Demonstrating and reducing shortcuts in vision-language representation learning,

  2. [2]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al

    URL https://arxiv.org/ abs/2402.17510. Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24185–24198,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    URL https://arxiv. org/abs/2507.06261. Runpeng Dai, Tong Zheng, Run Yang, and Hongtu Zhu. R1-re: Cross-domain relationship extraction with rlvr. arXiv preprint arXiv:2507.04642,

  4. [4]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    URL https://arxiv.org/abs/2305.06500. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi D...

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URL https://arxiv.org/abs/2501.12948. OpenAI et al. Gpt-4 technical report,

  6. [6]

    GPT-4 Technical Report

    URL https://arxiv.org/abs/2303.08774. Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf,

  7. [7]

    Reward shaping to mitigate reward hacking in rlhf

    URL https://arxiv.org/abs/2502.18770. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR,

  8. [8]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin

    URL https://arxiv.org/abs/2505.10802. Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models,

  9. [9]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    URL https://arxiv.org/abs/2503.06749. Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training,

  10. [10]

    Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training

    URL https://arxiv.org/abs/2404.14604. Younghwan Lee, Tung M. Luu, Donghoon Lee, and Chang D. Yoo. Reward generation via large vision-language model in offline reinforcement learning,

  11. [11]

    Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets

    URL https://arxiv.org/abs/ 2504.08772. Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models, 2025a. URL https://arxiv.org/ abs/2506.06395. Zhimin Li, Haichao Miao, Xinyuan Yan, Valerio Pascucci, Matthew Berger, and Shusen Liu. See or recall: A sanity chec...

  12. [12]

    Advances in Neural Information Processing Systems , year =

    URL https://arxiv.org/abs/2505.21523. Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating halluci- nation in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A...

  13. [13]

    A Survey on Hallucination in Large Vision-Language Models

    URL https://arxiv.org/abs/2402.00253. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b. URL https://arxiv.org/abs/2304.08485. Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Multimodal reference visual grounding. arXiv preprint arXiv:2504.02876,

  14. [14]

    Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li

    URL https://arxiv.org/abs/2506.12822. Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems, 2025a. URL https://arxiv.org/abs/2502.19328. Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu...

  15. [15]

    doi: 10.1002/j.1538-7305.1948.tb01338.x. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in rlvr,

  16. [16]

    Spurious Rewards: Rethinking Training Signals in RLVR

    URL https://arxiv.org/abs/2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models,

  17. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https://arxiv.org/abs/2402.03300. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615,

  18. [18]

    Language prior is not the only shortcut: A benchmark for shortcut learning in vqa

    Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pp. 3698–3712,

  19. [19]

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al

    URL https://arxiv.org/abs/2505.08827. Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186,

  20. [20]

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li

    URL https://arxiv.org/abs/2507.21931. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset,

  21. [21]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    URL https://arxiv.org/ abs/2402.14804. Tengjin Weng, Jingyi Wang, Wenhao Jiang, and Zhong Ming. Visnumbench: Evaluating number sense of multimodal large language models,

  22. [22]

    URL https://arxiv.org/abs/2503.14939. xAI. Realworldqa: Real-world spatial understanding benchmark. https://x.ai/blog/grok-1. 5v-and-realworldqa,

  23. [23]

    Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward, 2025

    URL https://arxiv.org/abs/2506.07218. Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440,

  24. [24]

    URL https://arxiv.org/abs/ 2505.23646. Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness, 2024a. URL https://arxiv.org/abs/2405.17220. Weihao Yu...

  25. [25]

    URLhttps://arxiv.org/abs/2401.10020. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understa...

  26. [26]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    URL https://arxiv.org/abs/2311.16502. Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark,

  27. [27]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    URL https://arxiv.org/ abs/2409.02813. Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization,

  28. [28]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    URL https://arxiv.org/abs/2503.12937. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024a. URL https://arxiv.org/abs/2403.14624. Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Q...

  29. [29]

    Learning to Reason without External Rewards

    URL https://arxiv.org/abs/2505.19590. Tong Zheng, Lichang Chen, Simeng Han, R Thomas McCoy, and Heng Huang. Learning to reason via mixture-of-thought for logical reasoning. arXiv preprint arXiv:2505.15817,

  30. [30]

    URL https://arxiv.org/abs/2405.14622. 14 Vision-SR1: Self-Rewarding VLM via Reasoning Decomposition A Experiment Details A.1 Prompt Templates This section presents the prompt templates used for constructing the cold start training data and Model Training prompt. The See-Think prompt is used for generating SFT See-Think data and model training. The Caption...