Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

Boyang Liu; Jiazheng Zhang; Peixin Wang; Qi Zhang; Senjie Jin; Shuo Li; Tao Gui; Xiaoran Fan; Xuanjing Huang; Yuhao Zhou

arxiv: 2606.03937 · v2 · pith:CJ5PAGNYnew · submitted 2026-06-02 · 💻 cs.AI

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

Senjie Jin , Peixin Wang , Boyang Liu , Xiaoran Fan , Shuo Li , Zhiheng Xi , Jiazheng Zhang , Yuhao Zhou

show 3 more authors

Tao Gui Qi Zhang Xuanjing Huang

This is my paper

Pith reviewed 2026-06-28 09:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningvisual reasoningtoken selectionpolicy optimizationentropymultimodalcredit assignmentvision-anchored

0 comments

The pith

Combining visual sensitivity with token entropy improves credit assignment in reinforcement learning for visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that entropy-based credit assignment works for text-only RL but collapses in visual reasoning because it skips vision-sensitive tokens that happen to carry low entropy. VEPO fixes this by multiplicatively combining a visual sensitivity score with entropy so that gradients favor tokens that are both perceptually grounded and semantically informative. The resulting policy optimization produces higher scores on visual reasoning tasks than entropy-only baselines. A sympathetic reader would care because the result points to a concrete way to make RL credit assignment respect both perception and reasoning in multimodal settings.

Core claim

VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative by integrating visual sensitivity with token entropy via a principled multiplicative coupling, leading to superior performance over entropy-only baselines in visual reasoning.

What carries the argument

VEPO (Vision-Entropy token-selection for Policy Optimization), which multiplicatively couples a visual sensitivity measure with token entropy to reweight credit assignment.

If this is right

VEPO outperforms the entropy-only baseline by 2.28 points at 7B scale.
VEPO outperforms the entropy-only baseline by 3.15 points at 3B scale.
Ablations confirm the multiplicative coupling improves results over entropy alone.
The method better interleaves precise perceptual grounding with semantic reasoning than entropy-only approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multiplicative anchoring of perceptual sensitivity to entropy could be tested in audio or tactile reasoning tasks.
The visual sensitivity component might need re-derivation for new vision encoders, and the paper leaves open whether the same scores transfer without retuning.
The result implies that any RL domain in which critical signals arrive with low entropy may benefit from an analogous non-entropy anchor.
pith_inferences

Load-bearing premise

A reliable systematic measure of visual sensitivity can be defined and multiplicatively combined with entropy without introducing biases that undermine the performance gains.

What would settle it

An ablation that replaces the visual sensitivity scores with random values uncorrelated to image content and measures whether the reported gains over the entropy baseline disappear.

Figures

Figures reproduced from arXiv: 2606.03937 by Boyang Liu, Jiazheng Zhang, Peixin Wang, Qi Zhang, Senjie Jin, Shuo Li, Tao Gui, Xiaoran Fan, Xuanjing Huang, Yuhao Zhou, Zhiheng Xi.

**Figure 2.** Figure 2: (a) Many high-JSD / high-|∆H| tokens lie in the low-entropy region. (b) Top-Entropy selection misses 41% of Top-JSD tokens at k= 20%, alongside a comparable proportion of Top-|∆H| tokens. (c,d) Tokens flagged by JSD / |∆H| (red triangles) frequently fall in entropy valleys and are missed by Top-Entropy selection. while substantially outperforming Random-k selection across all fractions. This empirically v… view at source ↗

**Figure 3.** Figure 3: The main training pipeline. VEPO performs a counterfactual forward pass with a noise perturbed image [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablations on the balancing coefficient α and the token selection ratio k. as a weaker JSD signal benefits from a broader token pool. Conversely, at α≥0.7, k=0.2 becomes optimal as the stronger JSD signal enables reliable selection from a sparser set. Detailed results on each benchmark are deferred to Section E.1. 5.2 Analysis Comparison with visual signal variants. We compare our visual signal formulation… view at source ↗

**Figure 5.** Figure 5: (a)(b) The training dynamics of VEPO and top-entropy and visual-focused RL methods (VPPO ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: A qualitative comparison between VEPO and Top-Entropy on a visually grounded program-tracing [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Another qualitative comparison between VEPO and Top-Entropy on a visually grounded program-tracing [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VEPO multiplies a visual sensitivity score with token entropy for RL credit assignment and reports 2-3 point gains over entropy baselines, but the abstract gives no independent definition or equation for the sensitivity measure.

read the letter

The paper's core move is to take the standard entropy-based token selection from text-only RLVR and multiply it by a visual sensitivity term so that gradients favor tokens that are both informative and grounded in the image. They argue that pure entropy collapses in visual reasoning because it skips low-entropy but visually relevant tokens, and they claim the coupling fixes this.

What stands out is the controlled comparison and the ablations they mention; those at least try to isolate the effect of the new term. The reported deltas (2.28 at 7B, 3.15 at 3B) are concrete enough to be worth checking.

The soft spot is exactly the one the stress-test raises. The abstract never shows how visual sensitivity is computed from vision features alone, whether the score is fixed before seeing RL outcomes, or whether it is orthogonal to entropy. Without an equation, an algorithm, or a pre-specified validation that the measure does not depend on the policy being trained, the performance edge cannot be confidently attributed to the multiplicative coupling rather than to some other choice in the implementation. That gap is load-bearing for the central claim.

The work is aimed at people building RL pipelines for vision-language models. A reader who already works on token-level credit assignment will get the most out of the experiments once the sensitivity definition is spelled out. The paper shows clear engagement with the RLVR literature and a focused attempt to adapt it, so it deserves a serious referee even if the current write-up is thin on the measurement side.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard token-level entropy for credit assignment in RLVR collapses for visual reasoning tasks because it omits vision-sensitive tokens that naturally exhibit low entropy. It introduces VEPO, which multiplicatively couples a visual sensitivity score with entropy to redirect policy gradients toward tokens that are both visually grounded and semantically informative, reporting gains of 2.28 points at 7B scale and 3.15 points at 3B scale over an entropy-only baseline, with supporting ablations.

Significance. If the visual sensitivity measure can be shown to be independently defined from vision features, orthogonal to entropy, and fixed before observing RL outcomes, the result would provide a concrete mechanism for interleaving perceptual grounding with reasoning in multimodal RL, addressing a documented failure mode of entropy-only methods at multiple model scales.

major comments (2)

[Abstract / §3] Abstract and §3 (method): the central claim rests on a 'systematic visual measurement' that is multiplicatively combined with entropy, yet no equation, algorithm, or pre-RL validation is supplied showing that this score is computed solely from vision encoder features, is independent of the current policy, and was not tuned post-hoc on the reported performance deltas.
[§4] §4 (experiments): the reported 2.28-point and 3.15-point gains over the entropy-only baseline are load-bearing for the contribution, but without an explicit definition or orthogonality test for the visual sensitivity term (e.g., correlation with entropy or ablation on its scaling), it remains unclear whether the improvement is attributable to the principled coupling or to the particular implementation of the sensitivity score.

minor comments (1)

[Abstract] The abstract refers to a 'controlled study' demonstrating entropy collapse; the corresponding section should explicitly state the protocol, metrics for visual sensitivity, and how low-entropy vision tokens were identified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity on the visual sensitivity measure. We address each major comment below and will revise the manuscript accordingly to include explicit definitions, pre-RL validations, and orthogonality analyses.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method): the central claim rests on a 'systematic visual measurement' that is multiplicatively combined with entropy, yet no equation, algorithm, or pre-RL validation is supplied showing that this score is computed solely from vision encoder features, is independent of the current policy, and was not tuned post-hoc on the reported performance deltas.

Authors: We agree that the current presentation lacks sufficient explicit detail. The visual sensitivity score is computed solely from the vision encoder's cross-attention maps between image patches and text tokens, using a fixed pre-RL procedure that does not depend on the policy parameters or RL outcomes. We will add the precise equation, the algorithm for computing the score, and a pre-RL validation (showing independence from policy and lack of post-hoc tuning) to the revised §3. revision: yes
Referee: [§4] §4 (experiments): the reported 2.28-point and 3.15-point gains over the entropy-only baseline are load-bearing for the contribution, but without an explicit definition or orthogonality test for the visual sensitivity term (e.g., correlation with entropy or ablation on its scaling), it remains unclear whether the improvement is attributable to the principled coupling or to the particular implementation of the sensitivity score.

Authors: We concur that additional empirical support is warranted. In the revision we will insert the explicit definition of the visual sensitivity term into §4, report its Pearson correlation with token entropy (to demonstrate orthogonality), and provide an ablation varying the scaling hyperparameter in the multiplicative coupling. These additions will isolate the contribution of the principled coupling from implementation choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract and claims present VEPO as an explicit new multiplicative coupling of visual sensitivity with token entropy, with performance gains demonstrated via experiments against an entropy-only baseline. No equations, fitted parameters, or self-citations are quoted that reduce the claimed improvement or the visual sensitivity measure to a redefinition of inputs by construction. The central premise relies on an independent systematic visual measurement whose computation is described as external to the RL optimization loop. Per the hard rules, absent specific quotes exhibiting reduction (e.g., a sensitivity score derived from the policy itself or a hyperparameter tuned post-hoc to the reported deltas), no circularity steps are identified. This is the expected honest non-finding for a method paper whose core contribution is a new explicit coupling rather than a renamed fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract relies on the domain assumption that token entropy drives semantic exploration and that visual sensitivity can be measured independently; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Token-level entropy remains a primary driver of semantic exploration even in multimodal settings.
Invoked when stating that existing methods overlook that token entropy primarily drives semantic exploration.
domain assumption Visual sensitivity of tokens can be measured in a way that is complementary to entropy.
Required for the multiplicative coupling to be meaningful.

pith-pipeline@v0.9.1-grok · 5747 in / 1247 out tokens · 22135 ms · 2026-06-28T09:56:56.601390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Gemini Team, Google. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Reinforced attention learning.arXiv preprint arXiv:2602.04884,

Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, and Derek Zhiyuan Cheng. 2026a. Reinforced attention learning.CoRR, abs...

work page arXiv
[3]

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming-Ming Cheng

The role of entropy in visual grounding: Anal- ysis and optimization.Preprint, arXiv:2512.06726. Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming-Ming Cheng. 2026b. Rethinking token-level policy optimization for multimodal chain-of-thought.arXiv preprint arXiv:2603.22847. Jianhua Lin. 1991. Divergence measures based ...

work page arXiv 1991
[4]

10 María Luisa Menéndez, Julio Angel Pardo, Leandro Pardo, and María del C Pardo

Association for Computational Linguistics. 10 María Luisa Menéndez, Julio Angel Pardo, Leandro Pardo, and María del C Pardo. 1997. The jensen- shannon divergence.Journal of the Franklin Institute, 334(2):307–318. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wen- hai Wang, Junjun He, Kaipeng Zhang, and 1 others

1997
[5]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365. Heejeong Nam, Jinwoo Ahn, Keummin Ka, Jiwan Chung, and Youngjae Yu. 2025. Vague: visual con- texts clarify ambiguous expressions. InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 1537–1547. OpenAI, :...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

InThe Fourteenth International Conference on Learning Representa- tions

Deepeyes: Incentivizing ”thinking with im- ages” via reinforcement learning. InThe Fourteenth International Conference on Learning Representa- tions. Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. 2024. Dpo meets ppo: Reinforced token opti- mization for rlhf.arXiv preprint arXiv:2404.18922. 12 A Pre...

work page arXiv 2024
[7]

Initialize the variables :a= 2, \quad b= 3, c= 4
[8]

Assign bto a :a=b= 3
[9]

Assign c + 2 to b :b=c+ 2 = 4 + 2 = 6
[10]

Assign b + 4 to c :c=b+ 4 = 6 + 4 = 10
[11]

Calculate d ) as the average of a, b, and c: d= \frac {a+b+ c}{3}= 3+6+10 3 = 19 3
[12]

Print the value ofd:d= 19 3 </ think > \ boxed { 19 3 } Representative Token Selections. Both: process Initialize variables Assign Assign Calculate Print VEPO-only: a c + b b + c as the the Entropy-only: think reasoning :\n\n quad ):\n [\n ]\n\n ) frac boxed Figure 7: A qualitative comparison between VEPO and Top-Entropy on a visually grounded program-tra...

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Gemini Team, Google. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Reinforced attention learning.arXiv preprint arXiv:2602.04884,

Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, and Derek Zhiyuan Cheng. 2026a. Reinforced attention learning.CoRR, abs...

work page arXiv

[3] [3]

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming-Ming Cheng

The role of entropy in visual grounding: Anal- ysis and optimization.Preprint, arXiv:2512.06726. Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming-Ming Cheng. 2026b. Rethinking token-level policy optimization for multimodal chain-of-thought.arXiv preprint arXiv:2603.22847. Jianhua Lin. 1991. Divergence measures based ...

work page arXiv 1991

[4] [4]

10 María Luisa Menéndez, Julio Angel Pardo, Leandro Pardo, and María del C Pardo

Association for Computational Linguistics. 10 María Luisa Menéndez, Julio Angel Pardo, Leandro Pardo, and María del C Pardo. 1997. The jensen- shannon divergence.Journal of the Franklin Institute, 334(2):307–318. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wen- hai Wang, Junjun He, Kaipeng Zhang, and 1 others

1997

[5] [5]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365. Heejeong Nam, Jinwoo Ahn, Keummin Ka, Jiwan Chung, and Youngjae Yu. 2025. Vague: visual con- texts clarify ambiguous expressions. InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 1537–1547. OpenAI, :...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

InThe Fourteenth International Conference on Learning Representa- tions

Deepeyes: Incentivizing ”thinking with im- ages” via reinforcement learning. InThe Fourteenth International Conference on Learning Representa- tions. Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. 2024. Dpo meets ppo: Reinforced token opti- mization for rlhf.arXiv preprint arXiv:2404.18922. 12 A Pre...

work page arXiv 2024

[7] [7]

Initialize the variables :a= 2, \quad b= 3, c= 4

[8] [8]

Assign bto a :a=b= 3

[9] [9]

Assign c + 2 to b :b=c+ 2 = 4 + 2 = 6

[10] [10]

Assign b + 4 to c :c=b+ 4 = 6 + 4 = 10

[11] [11]

Calculate d ) as the average of a, b, and c: d= \frac {a+b+ c}{3}= 3+6+10 3 = 19 3

[12] [12]

Print the value ofd:d= 19 3 </ think > \ boxed { 19 3 } Representative Token Selections. Both: process Initialize variables Assign Assign Calculate Print VEPO-only: a c + b b + c as the the Entropy-only: think reasoning :\n\n quad ):\n [\n ]\n\n ) frac boxed Figure 7: A qualitative comparison between VEPO and Top-Entropy on a visually grounded program-tra...