Visually-Guided Policy Optimization for Multimodal Reasoning

Feng Xiong; Liang Lin; Man Zhang; Xiangxiang Chu; Xuecai Hu; Yanlin Wang; Yong Wang; Zengbin Wang

arxiv: 2604.09349 · v2 · pith:X2JCNQCInew · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CL

Visually-Guided Policy Optimization for Multimodal Reasoning

Zengbin Wang , Feng Xiong , Liang Lin , Xuecai Hu , Yong Wang , Yanlin Wang , Man Zhang , Xiangxiang Chu This is my paper

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords visual attention compensationpolicy optimizationmultimodal reasoningvision-language modelsreinforcement learningvisual forgettingadvantage re-weighting

0 comments

The pith

Visually-Guided Policy Optimization counters visual forgetting in vision-language models by amplifying attention to image tokens during reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard reinforcement learning with verifiable rewards leaves vision-language models with weak attention to visual inputs and progressive loss of visual information across reasoning steps. It introduces Visually-Guided Policy Optimization to fix this by first using visual similarity to boost relevant image cues and raise visual expectations later in a sequence. A dual re-weighting step then favors tokens and full trajectories that show stronger visual engagement. If correct, the method produces measurably denser visual activation and higher accuracy on tasks that require both images and math reasoning. The central mechanism is therefore the combination of similarity-driven compensation and visual-accumulation-based advantage adjustment.

Core claim

VGPO introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues while progressively elevating visual expectations in later reasoning steps; it then applies a dual-grained advantage re-weighting strategy that highlights high-visual-activation tokens within trajectories and prioritizes trajectories with superior visual accumulation overall, yielding improved visual faithfulness and performance on mathematical multimodal reasoning and visual-dependent tasks.

What carries the argument

Visually-Guided Policy Optimization (VGPO) framework, built on Visual Attention Compensation that uses similarity to boost image cues and dual-grained (intra- and inter-trajectory) advantage re-weighting by visual activation.

If this is right

Models trained with VGPO maintain higher visual token activation across multi-step reasoning chains.
Accuracy rises on benchmarks that combine images with mathematical or visual reasoning demands.
The intra-trajectory re-weighting favors individual tokens with stronger visual signals inside any given rollout.
The inter-trajectory re-weighting selects entire reasoning paths that accumulate more visual evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compensation and re-weighting logic could be tested on non-mathematical visual tasks such as chart interpretation or diagram-based question answering.
If visual accumulation proves to be the key signal, similar re-weighting might be applied to other modalities like audio or video in future multimodal agents.
The approach assumes visual similarity remains a reliable proxy for relevance; a direct test would be to replace it with learned relevance scores and measure the difference.

Load-bearing premise

Amplifying visual cues through similarity-based compensation and re-weighting trajectories by visual accumulation will reliably counteract forgetting without introducing new biases or degrading text-based reasoning quality.

What would settle it

Running the same base model with and without VGPO on a visual-dependent math reasoning benchmark and finding no measurable rise in attention scores on image tokens or no accuracy improvement.

Figures

Figures reproduced from arXiv: 2604.09349 by Feng Xiong, Liang Lin, Man Zhang, Xiangxiang Chu, Xuecai Hu, Yanlin Wang, Yong Wang, Zengbin Wang.

**Figure 2.** Figure 2: Analysis of the inference nature of multimodal reasoning trajectory (based on Qwen2.5-VL-7B ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Visually-Guided Policy Optimization framework. Given query and image, (a) VGPO firstly utilizes the intrinsic hidden state similarity between generated tokens and visual prototype to derive a Visual Focus Score for visual token localization. (b) Then, Visual Attention Compensation (VAC) mechanism leverages this score to re-focus visual tokens, while progressively elevating visual expectations a… view at source ↗

**Figure 4.** Figure 4: Training dynamics based on Qwen2.5-VL7B: (a) training rewards and (b) validation accuracy on MMK12 (Meng et al., 2025) across GRPO (Shao et al., 2024), DAPO (Yu et al., 2025), and our VGPO. LogicVista (Xiao et al., 2024), SuperClevr Counting (Li et al., 2023), MMMU-Pro (Yue et al., 2025), and MathVerse-V (Zhang et al., 2024b). Specifically, we adopt these datasets from PAPOEval (Wang et al., 2025e), whi… view at source ↗

**Figure 5.** Figure 5: Ablation study of training dynamics of the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the vision attention ratio dis [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Training dynamics of Qwen2.5-VL-32B: (a) training rewards and (b) validation accuracy on MMK12 (Meng et al., 2025) across DAPO (Yu et al., 2025), and our VGPO. suggests that our visual-guided strategy effectively reduces variance and stabilizes the optimization process for large-scale models. Consistent with the rewards, [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGPO adds a visual attention compensation step and dual-grained re-weighting to RLVR for VLMs, but the abstract supplies no numbers or controls to show it works without new biases.

read the letter

The main takeaway is that this paper describes VGPO, which tries to keep VLMs from drifting away from visual tokens during multi-step reasoning. It does this by first using visual similarity to boost attention on relevant image parts and raise expectations in later steps, then re-weighting advantages both inside a trajectory (favoring high-visual tokens) and across trajectories (favoring those with more visual accumulation overall).

Referee Report

2 major / 0 minor

Summary. The paper proposes Visually-Guided Policy Optimization (VGPO) for vision-language models under RLVR. It identifies insufficient visual faithfulness and temporal visual forgetting as key issues, then introduces a Visual Attention Compensation mechanism that uses visual similarity to localize/amplify cues and raise expectations in later reasoning steps. This is combined with dual-grained advantage re-weighting (intra-trajectory on high visual activation tokens; inter-trajectory on high visual accumulation trajectories) to reinforce visual focus. The abstract states that extensive experiments show improved visual activation and superior performance on mathematical multimodal reasoning and visual-dependent tasks.

Significance. If the claimed gains are reproducible and the re-weighting does not trade off text reasoning quality, the framework could offer a targeted way to mitigate visual forgetting in VLMs, strengthening multimodal reasoning pipelines that currently default to text dominance.

major comments (2)

[Abstract] Abstract: the central empirical claim ('VGPO achieves better visual activation and superior performance...') is presented without any quantitative results, baselines, ablations, error bars, or dataset statistics. This absence is load-bearing because the method's value rests entirely on the assertion that the proposed compensation and re-weighting produce measurable gains.
[Method] Method description: no analysis or measurement is supplied showing that similarity-based compensation and the dual re-weighting preserve (or improve) text-only reasoning quality, nor that the similarity metric remains robust under visual variations. Without such controls the risk that re-weighting over-prioritizes visually salient but semantically secondary tokens cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate quantitative highlights in the abstract and additional analyses in the method/experiments sections.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim ('VGPO achieves better visual activation and superior performance...') is presented without any quantitative results, baselines, ablations, error bars, or dataset statistics. This absence is load-bearing because the method's value rests entirely on the assertion that the proposed compensation and re-weighting produce measurable gains.

Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised manuscript, we will add specific performance gains (e.g., accuracy improvements on MathVista and visual reasoning benchmarks versus baselines), a brief reference to ablations, and key dataset details. Full error bars and statistics will remain in the experimental section but will be summarized concisely in the abstract. revision: yes
Referee: [Method] Method description: no analysis or measurement is supplied showing that similarity-based compensation and the dual re-weighting preserve (or improve) text-only reasoning quality, nor that the similarity metric remains robust under visual variations. Without such controls the risk that re-weighting over-prioritizes visually salient but semantically secondary tokens cannot be assessed.

Authors: We will add a new analysis subsection demonstrating that VGPO preserves text-only reasoning quality via evaluations on pure-text benchmarks (showing no degradation or slight gains). We will also include robustness tests of the similarity metric under visual variations (e.g., augmentations and perturbations) and discuss how re-weighting correlates with semantic relevance to mitigate the noted risk. These results will be incorporated into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent validation

full rationale

The paper describes an empirical observation of visual forgetting in VLMs, then introduces VGPO with Visual Attention Compensation (using similarity to amplify cues) and dual re-weighting (intra/inter-trajectory by visual activation/accumulation). These are presented as novel mechanisms motivated by the observation, followed by experimental results on performance. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce any claim to its own inputs by construction. The central claims rest on the proposed framework and external experiments rather than definitional equivalence or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; full text would be required to audit these.

pith-pipeline@v0.9.0 · 5737 in / 895 out tokens · 32691 ms · 2026-05-25T06:39:30.980463+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Visual Focus Score ρi,t via cosine similarity S(hi,t, μv) between hidden state and visual prototype; wi,t = ρi,t · [1 + Gi(ρi,t)·β·t/Ti]; dual-grained re-weighting ψi,t, ϕi yielding ÂV i,t = Âi · (1+ψi,t)·(1+ϕi)
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

temporal visual forgetting mitigation via progressive elevation of visual expectations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Structured Role-Aware Policy Optimization for Multimodal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024a. Mllm-as- a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first Interna- tional Conference on ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Reasoning with Exploration: An Entropy Perspective

Reasoning with exploration: An entropy per- spective.arXiv preprint arXiv:2506.14758. Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. 2025. Gpg: A simple and strong reinforcement learning baseline for model reasoning. arXiv preprint arXiv:2504.02546. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. 2024. Blink: Multi- modal large language models can see ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and 1 oth- ers. 2025. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365. Runqi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025a. Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

By visually inspecting the image, it appears there are at least 10 vertical bars

Count the vertical bars: The image shows a series of vertical bars that form a fence. By visually inspecting the image, it appears there are at least 10 vertical bars. (visual counting error)

work page
[7]

There are 10 vertical bars

Count the elements: There is one cow visible in the image. There are 10 vertical bars

work page
[8]

Therefore, the total number of elements is: \boxed{11}

Calculate the total number of elements: Total elements = 10 + 1 = 11. Therefore, the total number of elements is: \boxed{11}. Failure Analysis: The model’s prediction fails due to an Object Counting Error. It correctly identifies the task of counting bars and adding the cow, but falters on the initial visual perception step, incorrectly counting 10 vertic...

work page

[1] [1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024a. Mllm-as- a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first Interna- tional Conference on ...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Reasoning with Exploration: An Entropy Perspective

Reasoning with exploration: An entropy per- spective.arXiv preprint arXiv:2506.14758. Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. 2025. Gpg: A simple and strong reinforcement learning baseline for model reasoning. arXiv preprint arXiv:2504.02546. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderj...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. 2024. Blink: Multi- modal large language models can see ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and 1 oth- ers. 2025. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365. Runqi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025a. Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

By visually inspecting the image, it appears there are at least 10 vertical bars

Count the vertical bars: The image shows a series of vertical bars that form a fence. By visually inspecting the image, it appears there are at least 10 vertical bars. (visual counting error)

work page

[7] [7]

There are 10 vertical bars

Count the elements: There is one cow visible in the image. There are 10 vertical bars

work page

[8] [8]

Therefore, the total number of elements is: \boxed{11}

Calculate the total number of elements: Total elements = 10 + 1 = 11. Therefore, the total number of elements is: \boxed{11}. Failure Analysis: The model’s prediction fails due to an Object Counting Error. It correctly identifies the task of counting bars and adding the cow, but falters on the initial visual perception step, incorrectly counting 10 vertic...

work page