Self-Rewarding Vision-Language Model via Reasoning Decomposition
Pith reviewed 2026-05-18 21:00 UTC · model grok-4.3
The pith
Vision-language models can improve their own visual reasoning by generating self-contained descriptions and rewarding themselves without external supervisors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision SR1 decomposes VLM reasoning into visual reasoning and language reasoning components. The model is prompted to first produce self-contained visual descriptions that suffice to answer the question without the original image. The same model is then reprompted using only those descriptions to compute a visual reward, which is combined with a language reasoning reward through a decoupled reward-advantage framework. This self-rewarding loop provides denser visual supervision during post-training and leads to better performance on vision-language tasks.
What carries the argument
Reasoning decomposition followed by self-reprompting: the VLM generates a visual description, then receives its own output as the sole input to score visual quality.
If this is right
- Models trained this way show fewer visual hallucinations on standard benchmarks.
- Performance gains appear across multiple vision-language tasks without added GPU cost for external reward models.
- The approach reduces the model's tendency to answer from text priors alone.
- Training remains feasible on the same hardware used for ordinary fine-tuning.
Where Pith is reading between the lines
- The same decomposition idea might let models self-supervise other intermediate steps such as spatial grounding or object relations.
- If the self-reward generalizes, it could lower the barrier to applying RL to new multimodal domains that lack ready-made external verifiers.
Load-bearing premise
Reprompting the model with only its own generated visual description produces an accurate and unbiased measure of visual quality without circular dependence on language priors.
What would settle it
Run the trained model on a set of images with deliberately altered visual details and check whether the computed visual rewards still track independent human judgments of whether the description matches the altered image.
Figures
read the original abstract
Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Vision SR1, a three-stage self-rewarding reinforcement learning framework for vision-language models. It decomposes reasoning into visual and language components by first prompting the model to generate self-contained visual descriptions sufficient to answer questions without the image, then reprompts the identical VLM with only that description to compute a visual reward signal. A decoupled reward-advantage framework separately calculates advantages for visual and language rewards before joint optimization. The central claims are that this yields improved visual reasoning, reduced hallucinations and language shortcuts across VLM tasks, and greater efficiency than external visual reward models without extra GPU overhead.
Significance. If the self-reward mechanism can be shown to provide an independent and faithful signal of visual grounding, the approach would offer a scalable, low-overhead route to post-training VLMs that directly supervises intermediate visual reasoning rather than relying solely on final-answer matching. The efficiency claim relative to methods that host separate reward models is a practical strength, and the decomposition idea directly targets a documented weakness in current VLM alignment.
major comments (2)
- [Method (visual reward and self-containment validation)] Method section on visual reward computation: the claim that reprompting the same VLM with only the generated visual description produces a reliable, non-circular visual reward is load-bearing for the central thesis. Because generation and reward steps share parameters and training history, any internally consistent hallucination or language prior can yield high self-reward even when the description diverges from the image; the manuscript must supply concrete validation (e.g., correlation with human visual-faithfulness judgments or an external verifier on a held-out subset) rather than relying on the decoupled advantage framework alone to mitigate this risk.
- [Experiments and results] Experiments section: the abstract asserts improvements in visual reasoning, hallucination mitigation, and reduced language shortcuts, yet the provided text supplies no quantitative results, baselines, ablation tables, or error analysis. To support the efficiency and effectiveness claims, the results must include direct comparisons against both standard RLHF and external-reward baselines with metrics that isolate visual grounding (e.g., visual entailment accuracy or hallucination rate on POPE-style probes).
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., average improvement on a visual-reasoning benchmark) to allow readers to gauge effect size without reading the full paper.
- [Method] Notation for the decoupled advantage framework should be introduced with an explicit equation showing how visual and language advantages are computed and combined; the current high-level description leaves the precise loss objective unclear.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the visual reward mechanism and the presentation of experimental results. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Method (visual reward and self-containment validation)] Method section on visual reward computation: the claim that reprompting the same VLM with only the generated visual description produces a reliable, non-circular visual reward is load-bearing for the central thesis. Because generation and reward steps share parameters and training history, any internally consistent hallucination or language prior can yield high self-reward even when the description diverges from the image; the manuscript must supply concrete validation (e.g., correlation with human visual-faithfulness judgments or an external verifier on a held-out subset) rather than relying on the decoupled advantage framework alone to mitigate this risk.
Authors: We agree that explicit validation of the visual reward's faithfulness is essential to support the central claims. While the decoupled reward-advantage framework is designed to isolate visual and language contributions during optimization, it does not by itself prove that the self-reward signal is non-circular or grounded in the image. In the revised manuscript we will add a dedicated validation subsection that reports (1) Pearson correlation between the computed visual rewards and human annotations of visual faithfulness on a held-out subset of 500 examples, and (2) agreement rates with an external vision-language verifier on the same subset. These additions will be placed in the method and experiments sections to directly address the concern. revision: yes
-
Referee: [Experiments and results] Experiments section: the abstract asserts improvements in visual reasoning, hallucination mitigation, and reduced language shortcuts, yet the provided text supplies no quantitative results, baselines, ablation tables, or error analysis. To support the efficiency and effectiveness claims, the results must include direct comparisons against both standard RLHF and external-reward baselines with metrics that isolate visual grounding (e.g., visual entailment accuracy or hallucination rate on POPE-style probes).
Authors: We acknowledge that the initial submission did not present the quantitative results with sufficient prominence or detail. The full manuscript contains experimental evaluations, but we will substantially expand the experiments section in the revision. We will add tables comparing Vision SR1 against standard RLHF and external-reward baselines on multiple VLM benchmarks, reporting visual-grounding-specific metrics including visual entailment accuracy and hallucination rates measured via POPE-style probes. Ablation studies isolating the visual-reward component and error analysis of remaining failure cases will also be included to substantiate the efficiency and effectiveness claims. revision: yes
Circularity Check
Self-reward via reprompting same VLM creates internal dependence for visual signal
specific steps
-
self definitional
[Abstract]
"To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward."
The visual reward used to supervise the model is generated by feeding the model's own visual description back into the same VLM (no image, no external model). This makes the reward definitionally dependent on the model's internal language priors and consistency, so any systematic hallucination that is internally coherent will receive high self-reward, reducing the claimed independent validation of visual grounding to a self-referential loop.
full rationale
The central mechanism defines the visual reward by reprompting the identical model on its own generated description, which is load-bearing for the claim of reliable self-contained visual grounding without external supervision. This step is quoted directly from the abstract and matches the self-definitional pattern because the reward signal is constructed from the model's output rather than an independent verifier. However, the paper still reports external task benchmarks and efficiency gains, so the circularity is partial rather than total reduction of the entire result to its inputs. No equations or self-citations are shown to force the outcome by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generated visual descriptions are sufficient to answer the question without referring back to the input image.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning... the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We instead optimize a joint objective J(θ) = Es∼πθ [ rvisual(c, x) + rans(a, a∗)]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
Visual-Advantage On-Policy Distillation for Vision-Language Models
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
VIDEOP2R: Video Understanding from Perception to Reasoning
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
-
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
-
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
-
Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
REFORM is a three-stage reasoning curriculum plus the ROM dataset that achieves state-of-the-art generalization on multimodal manipulation detection benchmarks.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Semantic-Enriched Latent Visual Reasoning
SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2502.13923. 10 Vision-SR1: Self-Rewarding VLM via Reasoning Decomposition Maurits Bleeker, Mariya Hendriksen, Andrew Yates, and Maarten de Rijke. Demonstrating and reducing shortcuts in vision-language representation learning,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://arxiv.org/ abs/2402.17510. Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24185–24198,
-
[3]
URL https://arxiv. org/abs/2507.06261. Runpeng Dai, Tong Zheng, Run Yang, and Hongtu Zhu. R1-re: Cross-domain relationship extraction with rlvr. arXiv preprint arXiv:2507.04642,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
URL https://arxiv.org/abs/2305.06500. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi D...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URL https://arxiv.org/abs/2501.12948. OpenAI et al. Gpt-4 technical report,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https://arxiv.org/abs/2303.08774. Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Reward shaping to mitigate reward hacking in rlhf
URL https://arxiv.org/abs/2502.18770. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR,
-
[8]
URL https://arxiv.org/abs/2505.10802. Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models,
-
[9]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
URL https://arxiv.org/abs/2503.06749. Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URL https://arxiv.org/abs/2404.14604. Younghwan Lee, Tung M. Luu, Donghoon Lee, and Chang D. Yoo. Reward generation via large vision-language model in offline reinforcement learning,
-
[11]
Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets
URL https://arxiv.org/abs/ 2504.08772. Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models, 2025a. URL https://arxiv.org/ abs/2506.06395. Zhimin Li, Haichao Miao, Xinyuan Yan, Valerio Pascucci, Matthew Berger, and Shusen Liu. See or recall: A sanity chec...
-
[12]
Advances in Neural Information Processing Systems , year =
URL https://arxiv.org/abs/2505.21523. Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating halluci- nation in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A...
-
[13]
A Survey on Hallucination in Large Vision-Language Models
URL https://arxiv.org/abs/2402.00253. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b. URL https://arxiv.org/abs/2304.08485. Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Multimodal reference visual grounding. arXiv preprint arXiv:2504.02876,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li
URL https://arxiv.org/abs/2506.12822. Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems, 2025a. URL https://arxiv.org/abs/2502.19328. Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu...
-
[15]
doi: 10.1002/j.1538-7305.1948.tb01338.x. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in rlvr,
-
[16]
Spurious Rewards: Rethinking Training Signals in RLVR
URL https://arxiv.org/abs/2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL https://arxiv.org/abs/2402.03300. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Language prior is not the only shortcut: A benchmark for shortcut learning in vqa
Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pp. 3698–3712,
work page 2022
-
[19]
URL https://arxiv.org/abs/2505.08827. Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186,
-
[20]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li
URL https://arxiv.org/abs/2507.21931. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset,
-
[21]
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
URL https://arxiv.org/ abs/2402.14804. Tengjin Weng, Jingyi Wang, Wenhao Jiang, and Zhong Ming. Visnumbench: Evaluating number sense of multimodal large language models,
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
-
[23]
URL https://arxiv.org/abs/2506.07218. Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440,
-
[24]
URL https://arxiv.org/abs/ 2505.23646. Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness, 2024a. URL https://arxiv.org/abs/2405.17220. Weihao Yu...
-
[25]
URLhttps://arxiv.org/abs/2401.10020. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understa...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
URL https://arxiv.org/abs/2311.16502. Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
URL https://arxiv.org/ abs/2409.02813. Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
URL https://arxiv.org/abs/2503.12937. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024a. URL https://arxiv.org/abs/2403.14624. Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Q...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Learning to Reason without External Rewards
URL https://arxiv.org/abs/2505.19590. Tong Zheng, Lichang Chen, Simeng Han, R Thomas McCoy, and Heng Huang. Learning to reason via mixture-of-thought for logical reasoning. arXiv preprint arXiv:2505.15817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
URL https://arxiv.org/abs/2405.14622. 14 Vision-SR1: Self-Rewarding VLM via Reasoning Decomposition A Experiment Details A.1 Prompt Templates This section presents the prompt templates used for constructing the cold start training data and Model Training prompt. The See-Think prompt is used for generating SFT See-Think data and model training. The Caption...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.