Recognition: 2 theorem links
· Lean TheoremNot All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Pith reviewed 2026-05-13 21:18 UTC · model grok-4.3
The pith
Perception-Grounded Policy Optimization reshapes token advantages by their visual dependency, lifting multimodal reasoning performance 18.7 percent on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perception-Grounded Policy Optimization (PGPO) is a token-level credit assignment method that quantifies each token’s causal dependence on visual input via KL divergence between visual-conditioned and text-only predictive distributions and then applies a threshold-gated, mass-conserving reshape to the advantage vector, thereby concentrating policy gradient updates on perception-grounded reasoning steps.
What carries the argument
Token Visual Dependency, computed as the Kullback-Leibler divergence between the model’s visual-conditioned and text-only next-token distributions, which acts as a sparse per-token mask to dynamically reshape policy advantages.
If this is right
- Gradient variance drops because noise from text-only tokens is suppressed.
- Training stability improves and collapse is avoided without extra regularization terms.
- Multimodal reasoning accuracy rises across diverse benchmarks while using the identical base model and reward.
- The method functions as an implicit regularizer that favors solutions grounded in perception over pure linguistic shortcuts.
Where Pith is reading between the lines
- The same dependency measure could be applied to audio or other sensory conditioning by swapping the visual input for the appropriate modality.
- Threshold selection may be task-dependent; a learned gating network could replace the fixed threshold as a direct extension.
- PGPO could be combined with external-tool or chain-of-thought verification to further isolate tokens that require non-linguistic information.
Load-bearing premise
The KL divergence between visual and text-only distributions reliably isolates tokens whose correctness depends on seeing the image rather than on language statistics alone.
What would settle it
Run the same training loop on a purely linguistic reasoning dataset with no images; if PGPO still improves or maintains performance, the visual-dependency signal is not doing the claimed causal work.
Figures
read the original abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Perception-Grounded Policy Optimization (PGPO) for large vision-language models under reinforcement learning from verifiable rewards. It defines Token Visual Dependency as the KL divergence between visual-conditioned and text-only next-token predictive distributions, then applies a threshold-gated, mass-conserving reshaping of per-token advantages to amplify signals for visually dependent tokens while suppressing linguistic priors. Experiments on the Qwen2.5-VL series across seven multimodal reasoning benchmarks report an average 18.7% performance improvement, accompanied by theoretical and empirical claims of reduced gradient variance and avoidance of training collapse.
Significance. If the core mechanism is validated, PGPO would provide a concrete method for fine-grained credit assignment in multimodal RLVR, directly addressing the uniform-advantage dilution problem. The combination of an externally defined KL measure, theoretical variance analysis, and large reported gains on challenging benchmarks would be a useful contribution to perception-grounded reasoning in LVLMs. Planned code release is a positive factor for reproducibility.
major comments (3)
- [§3.1–3.2] §3.1–3.2: The central claim that KL divergence isolates causally visual tokens (rather than correlational co-occurrence statistics) is load-bearing for the advantage-reshaping step, yet the manuscript provides no controlled ablations such as visual ablation, counterfactual image edits, or gradient attribution to demonstrate that high-KL tokens are precisely those whose removal collapses multimodal performance.
- [§4.1 and Table 2] §4.1 and Table 2: The reported 18.7% average gain and gradient-variance reduction lack full specification of data splits, exact baseline RLVR implementations, and hyperparameter sensitivity for the visual-dependency threshold; without these, the empirical claims cannot be independently verified.
- [§3.3] §3.3: The mass-conserving reshaping is presented as bias-free, but the interaction between the chosen threshold and the resulting advantage distribution is not analyzed for potential new collapse modes or unintended reinforcement of linguistic priors.
minor comments (2)
- [§3.2] Notation for the threshold-gated operator and the mass-conserving normalization should be made fully explicit in the main text rather than deferred to the appendix.
- [Related Work] The manuscript would benefit from additional references to prior token-level credit assignment methods in vision-language RL.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which have helped us identify areas for improvement. We address each major comment below and commit to revisions that strengthen the manuscript's claims on causality, reproducibility, and analysis of the reshaping mechanism.
read point-by-point responses
-
Referee: [§3.1–3.2] The central claim that KL divergence isolates causally visual tokens (rather than correlational co-occurrence statistics) is load-bearing for the advantage-reshaping step, yet the manuscript provides no controlled ablations such as visual ablation, counterfactual image edits, or gradient attribution to demonstrate that high-KL tokens are precisely those whose removal collapses multimodal performance.
Authors: We acknowledge that the manuscript would benefit from explicit causal validation beyond the definitional construction of KL divergence (which subtracts the text-only distribution to isolate visual information gain). In the revision, we will add controlled experiments: (i) performance degradation when high-KL tokens are masked versus low-KL or random tokens, (ii) results on counterfactual image edits for a subset of examples, and (iii) gradient attribution comparisons. These will appear in an expanded Section 3.2 and appendix, directly supporting the load-bearing claim without altering the core method. revision: yes
-
Referee: [§4.1 and Table 2] The reported 18.7% average gain and gradient-variance reduction lack full specification of data splits, exact baseline RLVR implementations, and hyperparameter sensitivity for the visual-dependency threshold; without these, the empirical claims cannot be independently verified.
Authors: We agree that the current description is insufficient for independent verification. The revised manuscript will expand Section 4.1 with: exact train/validation/test splits for all seven benchmarks, precise baseline RLVR code-level implementation details (including optimizer, batch size, and learning rate schedules matching the referenced works), and a full hyperparameter sensitivity analysis for the threshold (including variance plots across values). The code release will contain all scripts and configs to enable exact reproduction. revision: yes
-
Referee: [§3.3] The mass-conserving reshaping is presented as bias-free, but the interaction between the chosen threshold and the resulting advantage distribution is not analyzed for potential new collapse modes or unintended reinforcement of linguistic priors.
Authors: We thank the referee for highlighting this gap. While the mass-conserving property mathematically preserves the expected advantage (ensuring no net bias in the policy gradient), we will add a dedicated analysis in the revised Section 3.3. This includes: (i) theoretical discussion of threshold effects on the advantage distribution, (ii) empirical plots showing advantage histograms and training dynamics for multiple thresholds, and (iii) checks confirming absence of collapse or linguistic prior reinforcement. Current runs remain stable, but the added material will address potential edge cases. revision: yes
Circularity Check
No significant circularity; derivation defines external measure then applies it empirically
full rationale
The paper defines Token Visual Dependency explicitly as the KL divergence between visual-conditioned and text-only next-token distributions, then uses a threshold-gated reshaping of advantages in PGPO. This is a definitional step followed by an algorithmic application, not a reduction where the claimed performance gain or the dependency measure is forced by construction from its own inputs. No equations equate a prediction to a fitted parameter, no self-citation chain bears the central claim, and the 18.7% benchmark gains are presented as experimental outcomes rather than mathematical identities. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- visual dependency threshold
axioms (1)
- domain assumption KL divergence between visual-conditioned and text-only next-token distributions quantifies causal visual information gain
invented entities (1)
-
Token Visual Dependency
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate Token Visual Dependency, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions... threshold-gated, mass-conserving mechanism
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PGPO... reduces gradient variance, prevents training collapse, and acts as a potent regularizer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Structured Role-Aware Policy Optimization for Multimodal Reasoning
SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...
Reference graph
Works this paper leans on
-
[1]
Mitigating object hallucinations in large vision- language models through visual contrastive decoding. Preprint, arXiv:2311.16922. Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Rui- han Chen, Xiachong Feng, and Bing Qin. 2025a. Un- locking multilingual reasoning capability of llms and 9 lvlms through representation engineering.Preprint, arXiv:2511.23231...
-
[2]
Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and 1 others. 2025. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo.arXiv preprint ...
-
[3]
2=E[A 2]E TX t=1 ut 2 2 =E[A 2]E TX t=1 ∥ut∥2 2 + X t̸=j u⊤ t uj Asm
GRPO baseline.The trajectory gradient in GRPO is ggrpo =A TX t=1 ut.(16) E ∥ggrpo∥2 2 =E A TX t=1 ut 2 2 Asm. 2=E[A 2]E TX t=1 ut 2 2 =E[A 2]E TX t=1 ∥ut∥2 2 + X t̸=j u⊤ t uj Asm. 1 ≈E[A 2] TX t=1 E[∥ut∥2 2] =E[A 2] X t∈V E[∥ut∥2 2] + X k∈B E[∥uk∥2 2] ! . (17) The second summation in Eq. 17 is pure nuisance contribution: it is nonz...
-
[4]
2=E[A 2]E TX t=1 ∥˜ωtut∥2 2 + X t̸=j ˜ωt˜ωju⊤ t uj Asm
PGPO estimator.PGPO applies token-wise modulation: gpgpo =A TX t=1 ˜ωtut.(18) 16 Then E ∥gpgpo∥2 2 =E A TX t=1 ˜ωtut 2 2 Asm. 2=E[A 2]E TX t=1 ∥˜ωtut∥2 2 + X t̸=j ˜ωt˜ωju⊤ t uj Asm. 1 ≈E[A 2] TX t=1 E[˜ω2 t ∥ut∥2 2] Asm. 3=E[A 2] TX t=1 E[˜ω2 t ]E[∥u t∥2 2] =E[A 2] X t∈V E[˜ω2 t ]E[∥ut∥2 2] + X k∈B E[˜ω2 k]E[∥uk∥2 2] ! . (19)
-
[5]
Noise bound and interpretation.By con- struction, PGPO keeps ˜ωt around unit scale on t∈ V and enforces ˜ωk ≤ε on k∈ B . Plugging this into Eq. 19 gives E ∥gpgpo∥2 2 ≤ E[A2] X t∈V E[∥ut∥2 2] +ε 2X k∈B E[∥uk∥2 2] ! . (20) Combined with Assumption 4, the bound shows a clean separation: the useful visually grounded component is maintained, while the nuisance...
work page 2020
-
[6]
Strict inflation in Lowner order.If F≻0 and |µ|> 2∥C∥2 λmin(F) ,(28) then Cov(ˆgµ)≻Cov( ˆg0).(29)
-
[7]
Asymptotic quadratic dominance.As |µ| → ∞, Cov(ˆgµ)−Cov( ˆg0) =µ 2F+O(|µ|),(30) so the covariance penalty is asymptotically dominated by the quadratic termµ 2F. Proof. Since E[ˆgµ] =E[ ˆg0], the covariance differ- ence equals the difference of the second-moment matrices: ∆Cov : = Cov(ˆgµ)−Cov( ˆg0) =E[ ˆgµˆg⊤ µ ]−E[ ˆg0ˆg⊤ 0 ].(31) Expanding ˆgµ = (A∗ +µ)...
-
[8]
The numerator becomes ˜St − ˜St = 0, yieldingI t = 0
When ˜St ≤m (Token t is the minimum): minj ˜Sj = ˜St. The numerator becomes ˜St − ˜St = 0, yieldingI t = 0. The derivative is0
-
[9]
The derivative is: ∂It ∂ ˜St = 1 M−m+ϵ >0(46)
When m < ˜St < M (Token t is an inter- mediate value): Here, minj ˜Sj =m and maxj ˜Sj =M. The derivative is: ∂It ∂ ˜St = 1 M−m+ϵ >0(46)
-
[10]
When ˜St ≥M (Token t is the maximum): Here, maxj ˜Sj = ˜St and minj ˜Sj =m . Ap- plying the quotient rule: ∂It ∂ ˜St = ∂ ∂ ˜St ˜St −m ˜St −m+ϵ ! = ϵ ( ˜St −m+ϵ) 2 >0(47) Across all intervals, ∂It ∂ ˜St ≥0 . Hence, It is mono- tonically non-decreasing with respect to ˜St. Step 3: Threshold-Gating Function.The piece- wise gating function computes the base w...
-
[11]
When It < τ : ωt = It τ+ϵ =⇒ ∂ωt ∂It = 1 τ+ϵ > 0. 19
-
[12]
When It ≥τ : ωt = 1+β It−τ 1−τ+ϵ =⇒ ∂ωt ∂It = β 1−τ+ϵ ≥0(strictly positive forβ >0)
-
[13]
The left-hand limit is limIt→τ − ω(It) = τ τ+ϵ
Boundary at It =τ : The right-hand eval- uation is ω(τ) = 1 . The left-hand limit is limIt→τ − ω(It) = τ τ+ϵ . Since ϵ >0 , the left limit is strictly less than 1. This posi- tive jump discontinuity guarantees that the function strictly preserves the non-decreasing property as it crosses the threshold. Consequently, ωt is monotonically non-decreasing with...
-
[14]
Logarithmic Compression:Since f(x) = log(1 +x) is strictly monotonically increasing for x≥0, we have: SA >S B =⇒ ˜SA > ˜SB (50)
-
[15]
Min-Max Normalization:The normalization applies a linear transformation: IA = ˜SA −m M−m+ϵ , I B = ˜SB −m M−m+ϵ (51) Because the denominator (M−m+ϵ)>0 is iden- tical for both tokens, the order is strictly preserved: IA > I B
-
[16]
Threshold-Gating Function:For β >0 , the piecewise function ω(I) consists of two linear seg- ments with strictly positive slopes, connected by a positive upward jump at I=τ . Because the func- tion is globally strictly monotonically increasing across its domain[0,1], we obtain: IA > I B =⇒ω A > ω B (52)
-
[17]
Sum-Preserving Renormalization:The final step applies sequence-level scaling: ˜ωA =ω A · T Stotal ,˜ω B =ω B · T Stotal (53) Since the scalar multiplier T Stotal >0 is a shared constant, the inequality is maintained: ˜ωA >˜ωB. Conclusion.Through all sequential operations, the relationship SA >S B =⇒˜ωA >˜ωB holds strictly true. Therefore, the proposed mod...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.