arxiv: 2604.01840 · v2 · submitted 2026-04-02 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Zekai Ye , Qiming Li , Xiaocheng Feng , Ruihan Chen , Ziming Li , Haoyu Ren , Kun Chen , Dandan Tu

show 1 more author

Bing Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords perception-grounded policy optimizationtoken visual dependencymultimodal reasoninglarge vision-language modelsreinforcement learning from verifiable rewardscredit assignmentgradient variance reduction

0 comments

The pith

Perception-Grounded Policy Optimization reshapes token advantages by their visual dependency, lifting multimodal reasoning performance 18.7 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning from verifiable rewards assigns the same advantage to every token in a generated sequence. This dilutes the learning signal for the sparse subset of tokens whose correctness actually depends on the image. The paper defines Token Visual Dependency as the KL divergence between a model’s next-token distribution when the image is present versus when it is replaced by text-only conditioning. It then introduces PGPO, a threshold-gated mass-conserving reshaper that amplifies advantages for high-dependency tokens while suppressing gradient noise from linguistic priors. Experiments on the Qwen2.5-VL family across seven multimodal reasoning benchmarks show consistent gains, lower gradient variance, and reduced collapse risk.

Core claim

Perception-Grounded Policy Optimization (PGPO) is a token-level credit assignment method that quantifies each token’s causal dependence on visual input via KL divergence between visual-conditioned and text-only predictive distributions and then applies a threshold-gated, mass-conserving reshape to the advantage vector, thereby concentrating policy gradient updates on perception-grounded reasoning steps.

What carries the argument

Token Visual Dependency, computed as the Kullback-Leibler divergence between the model’s visual-conditioned and text-only next-token distributions, which acts as a sparse per-token mask to dynamically reshape policy advantages.

If this is right

Gradient variance drops because noise from text-only tokens is suppressed.
Training stability improves and collapse is avoided without extra regularization terms.
Multimodal reasoning accuracy rises across diverse benchmarks while using the identical base model and reward.
The method functions as an implicit regularizer that favors solutions grounded in perception over pure linguistic shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dependency measure could be applied to audio or other sensory conditioning by swapping the visual input for the appropriate modality.
Threshold selection may be task-dependent; a learned gating network could replace the fixed threshold as a direct extension.
PGPO could be combined with external-tool or chain-of-thought verification to further isolate tokens that require non-linguistic information.

Load-bearing premise

The KL divergence between visual and text-only distributions reliably isolates tokens whose correctness depends on seeing the image rather than on language statistics alone.

What would settle it

Run the same training loop on a purely linguistic reasoning dataset with no images; if PGPO still improves or maintains performance, the visual-dependency signal is not doing the claimed causal work.

Figures

Figures reproduced from arXiv: 2604.01840 by Bing Qin, Dandan Tu, Haoyu Ren, Kun Chen, Qiming Li, Ruihan Chen, Xiaocheng Feng, Zekai Ye, Ziming Li.

**Figure 2.** Figure 2: Empirical analysis results of visual dependency. (a) The skewed distribution of token-level visual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our proposed PGPO framework. The PGPO pipeline begins by quantifying [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics on Qwen2.5-VL-7B. Training Stability. The effectiveness of PGPO is underpinned by superior training dynamics, as illustrated in the training curves against the baselines (Figure 4a), which demonstrates that PGPO 6 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Top 200 S tokens word cloud on all generated tokens of vision-dominant MathVerse. C.2 Visual Anchor Annotation To evaluate whether S is associated with semantic visual grounding, we constructed an annotation pipeline to identify "Visual Anchors"—tokens highly requiring image observation for inference. Annotation. We used complete trajectories from the vision-dominant MathVerse generation set. Given the c… view at source ↗

read the original abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PGPO uses KL divergence on token predictions to dynamically reshape RL advantages toward visually dependent tokens, delivering reported 18.7% gains, but the causal status of that KL measure remains the main open question.

read the letter

The main takeaway is that this paper gives a practical token-level credit assignment trick for RL fine-tuning of vision-language models. They define token visual dependency as the KL divergence between the model's next-token distribution with and without the image, then apply a threshold to amplify advantages only for high-dependency tokens while keeping total advantage mass fixed. On Qwen2.5-VL models this produces an 18.7% average lift across seven multimodal reasoning benchmarks, plus lower gradient variance and fewer collapse cases during training. The approach is simple enough that the core loop could be reproduced quickly if the code drops as promised. The experiments cover a reasonable range of tasks and the authors supply both empirical curves and some theoretical backing for the variance reduction, which addresses a known pain point in standard RLVR setups. Credit is due for making the sparsity of visual dependency explicit and turning it into an operational rule rather than leaving it as a vague intuition. The soft spot is exactly the one flagged in the stress test. KL divergence will flag any statistical difference between the two distributions, so it can pick up dataset co-occurrences that are not causally visual. Without ablations that actually remove or edit the image content and check whether performance drops precisely on the high-KL tokens, it is hard to know whether the method is truly grounding reasoning or just reweighting linguistic priors. The threshold itself is a free parameter whose sensitivity is not fully explored in the abstract. This work is aimed at groups already running RL post-training on LVLMs and looking for ways to reduce noise in long reasoning traces. It is coherent on its own terms and the empirical numbers are large enough to justify referee time, even if the causal interpretation needs tighter evidence. I would send it out for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Perception-Grounded Policy Optimization (PGPO) for large vision-language models under reinforcement learning from verifiable rewards. It defines Token Visual Dependency as the KL divergence between visual-conditioned and text-only next-token predictive distributions, then applies a threshold-gated, mass-conserving reshaping of per-token advantages to amplify signals for visually dependent tokens while suppressing linguistic priors. Experiments on the Qwen2.5-VL series across seven multimodal reasoning benchmarks report an average 18.7% performance improvement, accompanied by theoretical and empirical claims of reduced gradient variance and avoidance of training collapse.

Significance. If the core mechanism is validated, PGPO would provide a concrete method for fine-grained credit assignment in multimodal RLVR, directly addressing the uniform-advantage dilution problem. The combination of an externally defined KL measure, theoretical variance analysis, and large reported gains on challenging benchmarks would be a useful contribution to perception-grounded reasoning in LVLMs. Planned code release is a positive factor for reproducibility.

major comments (3)

[§3.1–3.2] §3.1–3.2: The central claim that KL divergence isolates causally visual tokens (rather than correlational co-occurrence statistics) is load-bearing for the advantage-reshaping step, yet the manuscript provides no controlled ablations such as visual ablation, counterfactual image edits, or gradient attribution to demonstrate that high-KL tokens are precisely those whose removal collapses multimodal performance.
[§4.1 and Table 2] §4.1 and Table 2: The reported 18.7% average gain and gradient-variance reduction lack full specification of data splits, exact baseline RLVR implementations, and hyperparameter sensitivity for the visual-dependency threshold; without these, the empirical claims cannot be independently verified.
[§3.3] §3.3: The mass-conserving reshaping is presented as bias-free, but the interaction between the chosen threshold and the resulting advantage distribution is not analyzed for potential new collapse modes or unintended reinforcement of linguistic priors.

minor comments (2)

[§3.2] Notation for the threshold-gated operator and the mass-conserving normalization should be made fully explicit in the main text rather than deferred to the appendix.
[Related Work] The manuscript would benefit from additional references to prior token-level credit assignment methods in vision-language RL.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which have helped us identify areas for improvement. We address each major comment below and commit to revisions that strengthen the manuscript's claims on causality, reproducibility, and analysis of the reshaping mechanism.

read point-by-point responses

Referee: [§3.1–3.2] The central claim that KL divergence isolates causally visual tokens (rather than correlational co-occurrence statistics) is load-bearing for the advantage-reshaping step, yet the manuscript provides no controlled ablations such as visual ablation, counterfactual image edits, or gradient attribution to demonstrate that high-KL tokens are precisely those whose removal collapses multimodal performance.

Authors: We acknowledge that the manuscript would benefit from explicit causal validation beyond the definitional construction of KL divergence (which subtracts the text-only distribution to isolate visual information gain). In the revision, we will add controlled experiments: (i) performance degradation when high-KL tokens are masked versus low-KL or random tokens, (ii) results on counterfactual image edits for a subset of examples, and (iii) gradient attribution comparisons. These will appear in an expanded Section 3.2 and appendix, directly supporting the load-bearing claim without altering the core method. revision: yes
Referee: [§4.1 and Table 2] The reported 18.7% average gain and gradient-variance reduction lack full specification of data splits, exact baseline RLVR implementations, and hyperparameter sensitivity for the visual-dependency threshold; without these, the empirical claims cannot be independently verified.

Authors: We agree that the current description is insufficient for independent verification. The revised manuscript will expand Section 4.1 with: exact train/validation/test splits for all seven benchmarks, precise baseline RLVR code-level implementation details (including optimizer, batch size, and learning rate schedules matching the referenced works), and a full hyperparameter sensitivity analysis for the threshold (including variance plots across values). The code release will contain all scripts and configs to enable exact reproduction. revision: yes
Referee: [§3.3] The mass-conserving reshaping is presented as bias-free, but the interaction between the chosen threshold and the resulting advantage distribution is not analyzed for potential new collapse modes or unintended reinforcement of linguistic priors.

Authors: We thank the referee for highlighting this gap. While the mass-conserving property mathematically preserves the expected advantage (ensuring no net bias in the policy gradient), we will add a dedicated analysis in the revised Section 3.3. This includes: (i) theoretical discussion of threshold effects on the advantage distribution, (ii) empirical plots showing advantage histograms and training dynamics for multiple thresholds, and (iii) checks confirming absence of collapse or linguistic prior reinforcement. Current runs remain stable, but the added material will address potential edge cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation defines external measure then applies it empirically

full rationale

The paper defines Token Visual Dependency explicitly as the KL divergence between visual-conditioned and text-only next-token distributions, then uses a threshold-gated reshaping of advantages in PGPO. This is a definitional step followed by an algorithmic application, not a reduction where the claimed performance gain or the dependency measure is forced by construction from its own inputs. No equations equate a prediction to a fitted parameter, no self-citation chain bears the central claim, and the 18.7% benchmark gains are presented as experimental outcomes rather than mathematical identities. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility; the method rests on the assumption that KL divergence between conditional distributions isolates visual causal influence and that a fixed threshold plus mass conservation preserves valid policy gradients.

free parameters (1)

visual dependency threshold
Used to gate which tokens receive amplified advantage; value not specified in abstract.

axioms (1)

domain assumption KL divergence between visual-conditioned and text-only next-token distributions quantifies causal visual information gain
Invoked to define Token Visual Dependency

invented entities (1)

Token Visual Dependency no independent evidence
purpose: Quantify per-token visual grounding for credit assignment
New quantity introduced to drive the advantage reshaping

pith-pipeline@v0.9.0 · 5560 in / 1295 out tokens · 33735 ms · 2026-05-13T21:18:27.824194+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate Token Visual Dependency, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions... threshold-gated, mass-conserving mechanism
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PGPO... reduces gradient variance, prevents training collapse, and acts as a potent regularizer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Structured Role-Aware Policy Optimization for Multimodal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

[1]

Preprint, arXiv:2311.16922

Mitigating object hallucinations in large vision- language models through visual contrastive decoding. Preprint, arXiv:2311.16922. Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Rui- han Chen, Xiachong Feng, and Bing Qin. 2025a. Un- locking multilingual reasoning capability of llms and 9 lvlms through representation engineering.Preprint, arXiv:2511.23231...

work page arXiv 2025
[2]

Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and 1 others. 2025. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo.arXiv preprint ...

work page arXiv 2025
[3]

2=E[A 2]E   TX t=1 ut 2 2   =E[A 2]E   TX t=1 ∥ut∥2 2 + X t̸=j u⊤ t uj   Asm

GRPO baseline.The trajectory gradient in GRPO is ggrpo =A TX t=1 ut.(16) E ∥ggrpo∥2 2 =E   A TX t=1 ut 2 2   Asm. 2=E[A 2]E   TX t=1 ut 2 2   =E[A 2]E   TX t=1 ∥ut∥2 2 + X t̸=j u⊤ t uj   Asm. 1 ≈E[A 2] TX t=1 E[∥ut∥2 2] =E[A 2] X t∈V E[∥ut∥2 2] + X k∈B E[∥uk∥2 2] ! . (17) The second summation in Eq. 17 is pure nuisance contribution: it is nonz...

work page
[4]

2=E[A 2]E   TX t=1 ∥˜ωtut∥2 2 + X t̸=j ˜ωt˜ωju⊤ t uj   Asm

PGPO estimator.PGPO applies token-wise modulation: gpgpo =A TX t=1 ˜ωtut.(18) 16 Then E ∥gpgpo∥2 2 =E   A TX t=1 ˜ωtut 2 2   Asm. 2=E[A 2]E   TX t=1 ∥˜ωtut∥2 2 + X t̸=j ˜ωt˜ωju⊤ t uj   Asm. 1 ≈E[A 2] TX t=1 E[˜ω2 t ∥ut∥2 2] Asm. 3=E[A 2] TX t=1 E[˜ω2 t ]E[∥u t∥2 2] =E[A 2] X t∈V E[˜ω2 t ]E[∥ut∥2 2] + X k∈B E[˜ω2 k]E[∥uk∥2 2] ! . (19)

work page
[5]

Plugging this into Eq

Noise bound and interpretation.By con- struction, PGPO keeps ˜ωt around unit scale on t∈ V and enforces ˜ωk ≤ε on k∈ B . Plugging this into Eq. 19 gives E ∥gpgpo∥2 2 ≤ E[A2] X t∈V E[∥ut∥2 2] +ε 2X k∈B E[∥uk∥2 2] ! . (20) Combined with Assumption 4, the bound shows a clean separation: the useful visually grounded component is maintained, while the nuisance...

work page 2020
[6]

Strict inflation in Lowner order.If F≻0 and |µ|> 2∥C∥2 λmin(F) ,(28) then Cov(ˆgµ)≻Cov( ˆg0).(29)

work page
[7]

Asymptotic quadratic dominance.As |µ| → ∞, Cov(ˆgµ)−Cov( ˆg0) =µ 2F+O(|µ|),(30) so the covariance penalty is asymptotically dominated by the quadratic termµ 2F. Proof. Since E[ˆgµ] =E[ ˆg0], the covariance differ- ence equals the difference of the second-moment matrices: ∆Cov : = Cov(ˆgµ)−Cov( ˆg0) =E[ ˆgµˆg⊤ µ ]−E[ ˆg0ˆg⊤ 0 ].(31) Expanding ˆgµ = (A∗ +µ)...

work page
[8]

The numerator becomes ˜St − ˜St = 0, yieldingI t = 0

When ˜St ≤m (Token t is the minimum): minj ˜Sj = ˜St. The numerator becomes ˜St − ˜St = 0, yieldingI t = 0. The derivative is0

work page
[9]

The derivative is: ∂It ∂ ˜St = 1 M−m+ϵ >0(46)

When m < ˜St < M (Token t is an inter- mediate value): Here, minj ˜Sj =m and maxj ˜Sj =M. The derivative is: ∂It ∂ ˜St = 1 M−m+ϵ >0(46)

work page
[10]

Ap- plying the quotient rule: ∂It ∂ ˜St = ∂ ∂ ˜St ˜St −m ˜St −m+ϵ ! = ϵ ( ˜St −m+ϵ) 2 >0(47) Across all intervals, ∂It ∂ ˜St ≥0

When ˜St ≥M (Token t is the maximum): Here, maxj ˜Sj = ˜St and minj ˜Sj =m . Ap- plying the quotient rule: ∂It ∂ ˜St = ∂ ∂ ˜St ˜St −m ˜St −m+ϵ ! = ϵ ( ˜St −m+ϵ) 2 >0(47) Across all intervals, ∂It ∂ ˜St ≥0 . Hence, It is mono- tonically non-decreasing with respect to ˜St. Step 3: Threshold-Gating Function.The piece- wise gating function computes the base w...

work page
[11]

When It < τ : ωt = It τ+ϵ =⇒ ∂ωt ∂It = 1 τ+ϵ > 0. 19

work page
[12]

When It ≥τ : ωt = 1+β It−τ 1−τ+ϵ =⇒ ∂ωt ∂It = β 1−τ+ϵ ≥0(strictly positive forβ >0)

work page
[13]

The left-hand limit is limIt→τ − ω(It) = τ τ+ϵ

Boundary at It =τ : The right-hand eval- uation is ω(τ) = 1 . The left-hand limit is limIt→τ − ω(It) = τ τ+ϵ . Since ϵ >0 , the left limit is strictly less than 1. This posi- tive jump discontinuity guarantees that the function strictly preserves the non-decreasing property as it crosses the threshold. Consequently, ωt is monotonically non-decreasing with...

work page
[14]

Logarithmic Compression:Since f(x) = log(1 +x) is strictly monotonically increasing for x≥0, we have: SA >S B =⇒ ˜SA > ˜SB (50)

work page
[15]

Min-Max Normalization:The normalization applies a linear transformation: IA = ˜SA −m M−m+ϵ , I B = ˜SB −m M−m+ϵ (51) Because the denominator (M−m+ϵ)>0 is iden- tical for both tokens, the order is strictly preserved: IA > I B

work page
[16]

Because the func- tion is globally strictly monotonically increasing across its domain[0,1], we obtain: IA > I B =⇒ω A > ω B (52)

Threshold-Gating Function:For β >0 , the piecewise function ω(I) consists of two linear seg- ments with strictly positive slopes, connected by a positive upward jump at I=τ . Because the func- tion is globally strictly monotonically increasing across its domain[0,1], we obtain: IA > I B =⇒ω A > ω B (52)

work page
[17]

Conclusion.Through all sequential operations, the relationship SA >S B =⇒˜ωA >˜ωB holds strictly true

Sum-Preserving Renormalization:The final step applies sequence-level scaling: ˜ωA =ω A · T Stotal ,˜ω B =ω B · T Stotal (53) Since the scalar multiplier T Stotal >0 is a shared constant, the inequality is maintained: ˜ωA >˜ωB. Conclusion.Through all sequential operations, the relationship SA >S B =⇒˜ωA >˜ωB holds strictly true. Therefore, the proposed mod...

work page 2024