Perception-Aware Policy Optimization for Multimodal Reasoning
Pith reviewed 2026-05-19 05:08 UTC · model grok-4.3
The pith
A KL divergence term added to RL algorithms enables multimodal models to perceive visuals better while reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAPO is a novel policy gradient algorithm that incorporates an Implicit Perception Loss in the form of a KL divergence term to encourage the model to learn perception alongside reasoning in multimodal tasks. When combined with the Double Entropy Loss for training stability, it delivers improvements of 4.4% to 17.5% on various benchmarks, with larger gains on high vision-dependency tasks, and reduces perception errors by 30.5%.
What carries the argument
The Implicit Perception Loss implemented as a KL divergence term that integrates perception supervision directly into the RL objective.
If this is right
- Multimodal reasoning benchmarks see gains between 4.4 and 17.5 percent.
- Tasks relying more on vision improve by 8.0 to 19.1 percent.
- Perception errors drop by 30.5 percent across evaluations.
- The method works as a plug-in for existing RLVR algorithms without needing curated data or reward models.
Where Pith is reading between the lines
- This suggests that perception errors can be targeted directly in the training objective rather than through post-processing or additional supervision.
- Similar KL-based terms might help in purely textual reasoning tasks if adapted to other error types.
- Future experiments could apply PAPO to different model architectures to see if the perception benefits hold beyond the tested LLMs.
Load-bearing premise
That the benefits come specifically from improved perception due to the KL term rather than from general regularization effects that any added loss term might provide.
What would settle it
A controlled experiment where the KL divergence is replaced by an equivalent-strength regularizer that does not target perception would falsify the claim if it produces similar error reductions and performance gains.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Perception-Aware Policy Optimization (PAPO), a policy gradient method that augments RLVR algorithms such as GRPO and DAPO with an Implicit Perception Loss (KL divergence term) to encourage perceptual awareness during multimodal reasoning, plus a Double Entropy Loss for training stability. It reports overall gains of 4.4%-17.5% on multimodal benchmarks, larger gains of 8.0%-19.1% on high-vision-dependency tasks, and a 30.5% reduction in perception errors, without requiring extra data, reward models, or teacher models.
Significance. If the attribution of gains to the perception-specific loss holds, the work provides a simple, data-efficient way to address perception bottlenecks in multimodal RLVR and could support more grounded reasoning in vision-language models. The public code and data release would strengthen reproducibility.
major comments (2)
- [Experiments] Experiments section: The central claim attributes the 30.5% perception-error reduction and the larger gains (8.0%-19.1%) on high-vision-dependency tasks specifically to the Implicit Perception Loss. However, the manuscript provides no ablations that isolate this KL term against generic regularization controls (e.g., a non-perception KL divergence or the Double Entropy Loss alone). Without such controls, the improvements cannot be securely distinguished from generic stabilization effects of the modified objective.
- [Method] Method section: The Implicit Perception Loss is defined as a KL divergence term intended to encourage perception while reasoning, yet the target distribution for the KL and the precise mechanism by which it affects visual tokens (as opposed to general output regularization) are not specified. This leaves the perception-specific interpretation under-supported.
minor comments (2)
- [Abstract] Abstract: The phrase 'diverse multimodal benchmarks' is used without naming the specific datasets or tasks; adding the list would improve immediate clarity for readers.
- [Experiments] The description of how perception errors are quantified and measured (e.g., error categorization protocol or annotation process) is referenced in the results but would benefit from a brief methods paragraph or appendix for full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond to each major comment below and indicate planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim attributes the 30.5% perception-error reduction and the larger gains (8.0%-19.1%) on high-vision-dependency tasks specifically to the Implicit Perception Loss. However, the manuscript provides no ablations that isolate this KL term against generic regularization controls (e.g., a non-perception KL divergence or the Double Entropy Loss alone). Without such controls, the improvements cannot be securely distinguished from generic stabilization effects of the modified objective.
Authors: We agree that the current experiments do not fully isolate the contribution of the Implicit Perception Loss from possible generic regularization effects. In the revised manuscript we will add the requested ablations, including a non-perception KL divergence baseline and an ablation using only the Double Entropy Loss, to better attribute the observed gains and perception-error reductions. revision: yes
-
Referee: [Method] Method section: The Implicit Perception Loss is defined as a KL divergence term intended to encourage perception while reasoning, yet the target distribution for the KL and the precise mechanism by which it affects visual tokens (as opposed to general output regularization) are not specified. This leaves the perception-specific interpretation under-supported.
Authors: We acknowledge that the manuscript would benefit from a more explicit definition of the target distribution in the KL term and a clearer description of its selective effect on visual tokens. We will revise the Method section to provide these details and strengthen the support for the perception-aware interpretation. revision: yes
Circularity Check
No significant circularity in the proposed method or empirical claims
full rationale
The paper introduces PAPO as a policy gradient algorithm by adding an Implicit Perception Loss (KL divergence term) and Double Entropy Loss to existing RLVR methods such as GRPO and DAPO. These components are presented as independent design choices to encourage perception-aware reasoning, with no equations or derivations that reduce the claimed outcomes to the inputs by construction. Reported gains (4.4%-17.5% overall, 8.0%-19.1% on high-vision tasks, 30.5% perception error reduction) are measured on external multimodal benchmarks rather than being fitted parameters or self-derived quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the central claims. The approach is self-contained as an algorithmic proposal with external validation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Perception of visual inputs is a major source of error in multimodal reasoning tasks
- ad hoc to paper A KL divergence term can be introduced to encourage perception while learning to reason
Forward citations
Cited by 19 Pith papers
-
Visual-Advantage On-Policy Distillation for Vision-Language Models
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
-
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained catego...
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
Rethinking Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Chen, L.; Li, L.; Zhao, H.; Song, Y.; and Vinci. 2025. R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than \ 3. https://github.com/Deep-Agent/R1-V. Accessed: 2025-07-03
work page 2025
-
[4]
Fan, Y.; He, X.; Yang, D.; Zheng, K.; Kuo, C.-C.; Zheng, Y.; Jyothi, N., Sravana Guan; Guan, X.; and Wang, X. E. 2025. GRIT: Teaching MLLMs to Think with Images
work page 2025
-
[5]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
hiyouga. 2025. MathRuler. https://github.com/hiyouga/MathRuler
work page 2025
-
[7]
Huang, W.; Jia, B.; Zhai, Z.; Cao, S.; Ye, Z.; Zhao, F.; Xu, Z.; Hu, Y.; and Lin, S. 2025. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. arXiv preprint arXiv:2503.06749
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Li, S.; Deng, K.; Wang, L.; Yang, H.; Peng, C.; Yan, P.; Shen, F.; Shen, H. T.; and Xu, X. 2025 a . Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning. arXiv preprint arXiv:2506.04755
- [9]
-
[10]
Li, Z.; Wang, X.; Stengel‑Eskin, E.; Kortylewski, A.; Ma, W.; Van·Durme, B.; and Yuille, A. 2023. Super‑CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14963--14973. IEEE
work page 2023
- [11]
- [12]
- [13]
-
[14]
Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.; Galley, M.; and Gao, J. 2023. MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models . arXiv prepring arXiv:2310.02255
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Lu, P.; Gong, R.; Jiang, S.; Qiu, L.; Huang, S.; Liang, X.; and Zhu, S. 2021. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference...
work page 2021
-
[16]
Ma, Y.; Du, L.; Shen, X.; Chen, S.; Li, P.; Ren, Q.; Ma, L.; Dai, Y.; Liu, P.; and Yan, J. 2025. One RL to See Them All: Visual Triple Unified Reinforcement Learning. arXiv preprint arXiv:2505.18129
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Meng, F.; Du, L.; Liu, Z.; Zhou, Z.; Lu, Q.; Fu, D.; Shi, B.; Wang, W.; He, J.; Zhang, K.; Luo, P.; Qiao, Y.; Zhang, Q.; and Shao, W. 2025 b . MM‑Eureka: Exploring Visual Aha Moment with Rule‑based Large‑scale Reinforcement Learning. arXiv preprint arXiv:2503.07365. Submitted March 10, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
OpenAI. 2025. Introducing GPT-4.1 in the API
work page 2025
-
[20]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H. V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El‑Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.; Li, S.; Misra, I.; Rabbat, M.; Sharma, V.; Synnaeve, G.; Xu, H.; Jégou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2023. DINOv2: Learning Robust Visua...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Qiao, R.; Tan, Q.; Dong, G.; Wu, M.; Sun, C.; Song, X.; GongQue, Z.; Lei, S.; Wei, Z.; Zhang, M.; et al. 2024. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Qwen Team, A. G. 2024 a . Qwen2.5-VL-3B-Instruct. https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
work page 2024
-
[23]
Qwen Team, A. G. 2024 b . Qwen2.5-VL-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
work page 2024
-
[24]
Schulman, J. 2020. Approximating KL Divergence. http://joschu.net/blog/kl-approx.html
work page 2020
-
[25]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Shannon, C. E. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3): 379--423
work page 1948
-
[27]
Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Shen, H.; Liu, P.; Li, J.; Fang, C.; Ma, Y.; Liao, J.; Shen, Q.; Zhang, Z.; Zhao, K.; Zhang, Q.; et al. 2025 a . Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [29]
-
[30]
Su, A.; Wang, H.; Ren, W.; Lin, F.; and Chen, W. 2025. Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning. arXiv preprint arXiv:2505.15966
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [31]
-
[32]
Wang, H.; Qu, C.; Huang, Z.; Chu, W.; Lin, F.; and Chen, W. 2025 a . VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning. arXiv preprint arXiv:2504.08837. Published April 10, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [33]
- [34]
- [36]
-
[37]
Xiao, Y.; Sun, E.; Liu, T.; and Wang, W. 2024. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Yang, Y.; He, X.; Pan, H.; Jiang, X.; Deng, Y.; Yang, X.; Lu, H.; Yin, D.; Rao, F.; Zhu, M.; et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Yao, H.; Yin, Q.; Zhang, J.; Yang, M.; Wang, Y.; Wu, W.; Su, F.; Shen, L.; Qiu, M.; Tao, D.; and Huang, J. 2025. R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
work page 2025
-
[40]
Yu, Q.; Zhang, Z.; Zhu, R.; Yuan, Y.; Zuo, X.; Yue, Y.; Dai, W.; Fan, T.; Liu, G.; Liu, L.; et al. 2025. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Yue, X.; Zheng, T.; Ni, Y.; Wang, Y.; Zhang, K.; Tong, S.; Sun, Y.; Yu, B.; Zhang, G.; Sun, H.; et al. 2024. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Zhang, R.; Jiang, D.; Zhang, Y.; Lin, H.; Guo, Z.; Qiu, P.; Zhou, A.; Lu, P.; Chang, K.; Gao, P.; and Li, H. 2024. MathVerse: Does Your Multi‑modal LLM Truly See the Diagrams in Visual Math Problems? CoRR, abs/2403.14624. Also published in the ECCV 2024 proceedings
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z.; Yang, M.; Hong, J.; Zhao, C.; Xu, G.; Yang, L.; Shen, C.; and Yu, X. 2025. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning. arXiv preprint arXiv:2505.14362
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. 2025 a . Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [45]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.