pith. sign in

arxiv: 2507.06448 · v5 · submitted 2025-07-08 · 💻 cs.CL

Perception-Aware Policy Optimization for Multimodal Reasoning

Pith reviewed 2026-05-19 05:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal reasoningpolicy gradientperception-aware learningKL divergencereinforcement learning with verifiable rewardsvision-language modelserror reduction
0
0 comments X

The pith

A KL divergence term added to RL algorithms enables multimodal models to perceive visuals better while reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop PAPO to fix a key problem in multimodal reasoning where models often fail at perceiving the visual input correctly. They add an implicit perception loss using KL divergence to the standard reinforcement learning with verifiable rewards setup, along with a double entropy loss for stability. This change requires no new data or models and can be added to popular methods like GRPO. If effective, it shows that perception can be trained jointly with reasoning through a simple modification to the objective function, leading to more reliable performance on tasks that depend heavily on understanding images.

Core claim

PAPO is a novel policy gradient algorithm that incorporates an Implicit Perception Loss in the form of a KL divergence term to encourage the model to learn perception alongside reasoning in multimodal tasks. When combined with the Double Entropy Loss for training stability, it delivers improvements of 4.4% to 17.5% on various benchmarks, with larger gains on high vision-dependency tasks, and reduces perception errors by 30.5%.

What carries the argument

The Implicit Perception Loss implemented as a KL divergence term that integrates perception supervision directly into the RL objective.

If this is right

  • Multimodal reasoning benchmarks see gains between 4.4 and 17.5 percent.
  • Tasks relying more on vision improve by 8.0 to 19.1 percent.
  • Perception errors drop by 30.5 percent across evaluations.
  • The method works as a plug-in for existing RLVR algorithms without needing curated data or reward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that perception errors can be targeted directly in the training objective rather than through post-processing or additional supervision.
  • Similar KL-based terms might help in purely textual reasoning tasks if adapted to other error types.
  • Future experiments could apply PAPO to different model architectures to see if the perception benefits hold beyond the tested LLMs.

Load-bearing premise

That the benefits come specifically from improved perception due to the KL term rather than from general regularization effects that any added loss term might provide.

What would settle it

A controlled experiment where the KL divergence is replaced by an equivalent-strength regularizer that does not target perception would falsify the claim if it produces similar error reductions and performance gains.

Figures

Figures reproduced from arXiv: 2507.06448 by Fei Huang, Haiyang Xu, Heng Ji, Hongru Wang, Hyeonjeong Ha, Ming Yan, Sofia Stoica, Xiusi Chen, Xuehang Guo, Yangyi Chen, Zhenhailong Wang.

Figure 1
Figure 1. Figure 1: Comprehensive error-type breakdown and inference example between GRPO and PAPO. We ob￾serve that perception errors account for the majority (67%) of failures in current multimodal reasoning mod￾els trained with GRPO. PAPO significantly reduces the dominant perception-driven errors by 30.5%, with the reduced portion indicated in gray. On the right, we present a representative inference example that illus￾tr… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the PAPOG objective, which extends GRPO by adding the Implicit Perception Loss (KLprcp). Additional Double Entropy Loss regularization (H[πθ], H[π mask θ ]) can be added for enhancing training stabilities. The KLprcp is formulated as maximizing the difference between the original policy πθ and a corrupted policy π mask θ , computed with a masked visual input. Intuitively, PAPO encourages th… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the training dynamics on the accuracy reward. Solid lines indicate running averages with a stepping window size of 20. PAPO demonstrates consistently faster learning from the early stages on both GRPO and DAPO. Notably, DAPO-7B suffers from model collapse in the later stages, whereas PAPOD achieves continued improvements without collapse, highlighting the effectiveness of the proposed Double … view at source ↗
Figure 4
Figure 4. Figure 4: Early signs of model collapsing due to KLprcp Hacking. The “No Collapse” and “Collapsed” models refer to PAPOG-7B (γ = 0.01) and PAPOG-7B (γ = 0.02 without double entropy regularization), respectively. When collapsing occurs, we notice (a-b) the Implicit Perception Loss drops drastically, accompanied by a collapsing training reward, (c) the clipping ratio-high continuously increases, which indicates the pr… view at source ↗
Figure 5
Figure 5. Figure 5: Influential factors towards KLprcp Hacking. We identify three main factors: (a) KLprcp weighting (higher values lead to a greater likelihood of collapse); (b) size (the larger the model, the more likely it is to collapse); (c) an extreme masking ratio (e.g., 1.0) results in a faster collapse. Collapsing behavior. We first examine how the model behaves after collapsing in terms of its generation. We manuall… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of different regularization strategies. All strategies are applied to the same col￾lapsing baseline, PAPOG (γ = 0.02, no regulariza￾tion). Among the four methods described in the main text, three successfully prevent the collapse entirely, while adding Single Masked Entropy only delays it. The proposed Double Entropy Loss demonstrates the best training dynamics and prevents the collapse . Evalua… view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative examples of different levels of vision-dependency. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of different masking strategies. Semantic-aware masking prioritizes patches [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics of DAPO baseline with entropy loss. Adding entropy loss to the DAPO-7B baseline delays collapse but still results in training instability. In contrast, PAPOD with Double Entropy Loss maintains stable training throughout and achieves superior performance, demonstrating the effectiveness of perception-aware optimization combined with robust regularization. The comparison of benchmark perfor… view at source ↗
Figure 10
Figure 10. Figure 10: Impact of KLprcp weighting (γ) under settings without reference KL. Double Entropy Loss is indispensable for stabilizing training in this setting. Due to inherently weaker regularization, γ should be set to a smaller value. When set higher (e.g., 0.02), model collapse still occurs, even with Double Entropy Loss. G ADDITIONAL RESULTS ON ABLATION STUDIES We provide additional results for the ablation studie… view at source ↗
Figure 11
Figure 11. Figure 11: Collapsing behavior. A distinctive generation pattern in collapsed models is the produc￾tion of irrelevant tokens. We verify this quantitatively by prompting GPT-4.1-mini OpenAI (2025) to provide relatedness scores of the responses from 0 to 10 for GRPO and collapsed PAPOG-7B (γ = 0.02, no regularization) model. We further compare the variance of KLprcp over the response tokens. As illustrated, the collap… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt to GPT-4.1-mini for scoring the relatedness between the model-generated response [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Perception-Aware Policy Optimization (PAPO), a policy gradient method that augments RLVR algorithms such as GRPO and DAPO with an Implicit Perception Loss (KL divergence term) to encourage perceptual awareness during multimodal reasoning, plus a Double Entropy Loss for training stability. It reports overall gains of 4.4%-17.5% on multimodal benchmarks, larger gains of 8.0%-19.1% on high-vision-dependency tasks, and a 30.5% reduction in perception errors, without requiring extra data, reward models, or teacher models.

Significance. If the attribution of gains to the perception-specific loss holds, the work provides a simple, data-efficient way to address perception bottlenecks in multimodal RLVR and could support more grounded reasoning in vision-language models. The public code and data release would strengthen reproducibility.

major comments (2)
  1. [Experiments] Experiments section: The central claim attributes the 30.5% perception-error reduction and the larger gains (8.0%-19.1%) on high-vision-dependency tasks specifically to the Implicit Perception Loss. However, the manuscript provides no ablations that isolate this KL term against generic regularization controls (e.g., a non-perception KL divergence or the Double Entropy Loss alone). Without such controls, the improvements cannot be securely distinguished from generic stabilization effects of the modified objective.
  2. [Method] Method section: The Implicit Perception Loss is defined as a KL divergence term intended to encourage perception while reasoning, yet the target distribution for the KL and the precise mechanism by which it affects visual tokens (as opposed to general output regularization) are not specified. This leaves the perception-specific interpretation under-supported.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'diverse multimodal benchmarks' is used without naming the specific datasets or tasks; adding the list would improve immediate clarity for readers.
  2. [Experiments] The description of how perception errors are quantified and measured (e.g., error categorization protocol or annotation process) is referenced in the results but would benefit from a brief methods paragraph or appendix for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below and indicate planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim attributes the 30.5% perception-error reduction and the larger gains (8.0%-19.1%) on high-vision-dependency tasks specifically to the Implicit Perception Loss. However, the manuscript provides no ablations that isolate this KL term against generic regularization controls (e.g., a non-perception KL divergence or the Double Entropy Loss alone). Without such controls, the improvements cannot be securely distinguished from generic stabilization effects of the modified objective.

    Authors: We agree that the current experiments do not fully isolate the contribution of the Implicit Perception Loss from possible generic regularization effects. In the revised manuscript we will add the requested ablations, including a non-perception KL divergence baseline and an ablation using only the Double Entropy Loss, to better attribute the observed gains and perception-error reductions. revision: yes

  2. Referee: [Method] Method section: The Implicit Perception Loss is defined as a KL divergence term intended to encourage perception while reasoning, yet the target distribution for the KL and the precise mechanism by which it affects visual tokens (as opposed to general output regularization) are not specified. This leaves the perception-specific interpretation under-supported.

    Authors: We acknowledge that the manuscript would benefit from a more explicit definition of the target distribution in the KL term and a clearer description of its selective effect on visual tokens. We will revise the Method section to provide these details and strengthen the support for the perception-aware interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed method or empirical claims

full rationale

The paper introduces PAPO as a policy gradient algorithm by adding an Implicit Perception Loss (KL divergence term) and Double Entropy Loss to existing RLVR methods such as GRPO and DAPO. These components are presented as independent design choices to encourage perception-aware reasoning, with no equations or derivations that reduce the claimed outcomes to the inputs by construction. Reported gains (4.4%-17.5% overall, 8.0%-19.1% on high-vision tasks, 30.5% perception error reduction) are measured on external multimodal benchmarks rather than being fitted parameters or self-derived quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the central claims. The approach is self-contained as an algorithmic proposal with external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that perception errors are the dominant failure mode in current multimodal reasoning and that a KL-based loss can directly encourage better perception without external supervision or data curation.

axioms (2)
  • domain assumption Perception of visual inputs is a major source of error in multimodal reasoning tasks
    Explicitly stated as an observation in the abstract that motivates the method.
  • ad hoc to paper A KL divergence term can be introduced to encourage perception while learning to reason
    The Implicit Perception Loss is defined in this form as a novel component of the algorithm.

pith-pipeline@v0.9.0 · 5860 in / 1422 out tokens · 47282 ms · 2026-05-19T05:08:43.233333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Visual-Advantage On-Policy Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

  2. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  3. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.

  4. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

  5. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  6. Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

    cs.CL 2026-01 unverdicted novelty 7.0

    Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

  7. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  8. MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    cs.CV 2026-05 unverdicted novelty 6.0

    MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.

  9. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  10. MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

    cs.LG 2026-02 conditional novelty 6.0

    MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

  11. MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

    cs.LG 2026-02 unverdicted novelty 6.0

    MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.

  12. Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

    cs.CV 2026-02 unverdicted novelty 6.0

    Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained catego...

  13. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  14. Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

    cs.CV 2026-04 unverdicted novelty 5.0

    Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...

  15. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  16. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  17. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 3.0

    The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

  18. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

  19. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 13 Pith papers · 19 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Chen, L.; Li, L.; Zhao, H.; Song, Y.; and Vinci. 2025. R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than \ 3. https://github.com/Deep-Agent/R1-V. Accessed: 2025-07-03

  4. [4]

    Fan, Y.; He, X.; Yang, D.; Zheng, K.; Kuo, C.-C.; Zheng, Y.; Jyothi, N., Sravana Guan; Guan, X.; and Wang, X. E. 2025. GRIT: Teaching MLLMs to Think with Images

  5. [5]

    Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  6. [6]

    hiyouga. 2025. MathRuler. https://github.com/hiyouga/MathRuler

  7. [7]

    Huang, W.; Jia, B.; Zhai, Z.; Cao, S.; Ye, Z.; Zhao, F.; Xu, Z.; Hu, Y.; and Lin, S. 2025. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. arXiv preprint arXiv:2503.06749

  8. [8]

    T.; and Xu, X

    Li, S.; Deng, K.; Wang, L.; Yang, H.; Peng, C.; Yan, P.; Shen, F.; Shen, H. T.; and Xu, X. 2025 a . Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning. arXiv preprint arXiv:2506.04755

  9. [9]

    Li, Y.; Wei, L.; Zheng, K.; Huang, J.; Kong, L.; Sun, L.; and Huang, W. 2025 b . Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning. arXiv preprint arXiv:2506.09736

  10. [10]

    Li, Z.; Wang, X.; Stengel‑Eskin, E.; Kortylewski, A.; Ma, W.; Van·Durme, B.; and Yuille, A. 2023. Super‑CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14963--14973. IEEE

  11. [11]

    Liang, T.; Liu, X.; He, P.; Mi, H.; Tu, Z.; and Yu, D. 2025. MoDoMoDo: Learning Mixture-of-Datasets with Reinforcement Learning for Multimodal Reasoning. arXiv preprint arXiv:2505.24871

  12. [12]

    Liu, X.; Ni, J.; Wu, Z.; Du, C.; Dou, L.; Wang, H.; Pang, T.; and Shieh, M. Q. 2025 a . Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055

  13. [13]

    Liu, Y.; Qu, T.; Zhong, Z.; Peng, B.; Liu, S.; Yu, B.; and Jia, J. 2025 b . VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.12081

  14. [14]

    Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.; Galley, M.; and Gao, J. 2023. MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models . arXiv prepring arXiv:2310.02255

  15. [15]

    Lu, P.; Gong, R.; Jiang, S.; Qiu, L.; Huang, S.; Liang, X.; and Zhu, S. 2021. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference...

  16. [16]

    Ma, Y.; Du, L.; Shen, X.; Chen, S.; Li, P.; Ren, Q.; Ma, L.; Dai, Y.; Liu, P.; and Yan, J. 2025. One RL to See Them All: Visual Triple Unified Reinforcement Learning. arXiv preprint arXiv:2505.18129

  17. [18]

    Meng, F.; Du, L.; Liu, Z.; Zhou, Z.; Lu, Q.; Fu, D.; Shi, B.; Wang, W.; He, J.; Zhang, K.; Luo, P.; Qiao, Y.; Zhang, Q.; and Shao, W. 2025 b . MM‑Eureka: Exploring Visual Aha Moment with Rule‑based Large‑scale Reinforcement Learning. arXiv preprint arXiv:2503.07365. Submitted March 10, 2025

  18. [19]

    OpenAI. 2025. Introducing GPT-4.1 in the API

  19. [20]

    Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H. V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El‑Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.; Li, S.; Misra, I.; Rabbat, M.; Sharma, V.; Synnaeve, G.; Xu, H.; Jégou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2023. DINOv2: Learning Robust Visua...

  20. [21]

    Qiao, R.; Tan, Q.; Dong, G.; Wu, M.; Sun, C.; Song, X.; GongQue, Z.; Lei, S.; Wei, Z.; Zhang, M.; et al. 2024. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284

  21. [22]

    Qwen Team, A. G. 2024 a . Qwen2.5-VL-3B-Instruct. https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

  22. [23]

    Qwen Team, A. G. 2024 b . Qwen2.5-VL-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

  23. [24]

    Schulman, J. 2020. Approximating KL Divergence. http://joschu.net/blog/kl-approx.html

  24. [25]

    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  25. [26]

    Shannon, C. E. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3): 379--423

  26. [27]

    Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  27. [28]

    Shen, H.; Liu, P.; Li, J.; Fang, C.; Ma, Y.; Liao, J.; Shen, Q.; Zhang, Z.; Zhao, K.; Zhang, Q.; et al. 2025 a . Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615

  28. [29]

    Shen, L.; Li, Y.; Mi, H.; Wang, W.; Tu, Z.; and Yu, D. 2025 b . SATORI-R1: Spatially Anchored Training with Verifiable Rewards for Vision-Language Reasoning. arXiv preprint arXiv:2505.19094

  29. [30]

    Su, A.; Wang, H.; Ren, W.; Lin, F.; and Chen, W. 2025. Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning. arXiv preprint arXiv:2505.15966

  30. [31]

    Wan, Z.; Dou, Z.; Liu, C.; Zhang, Y.; Cui, D.; Zhao, Q.; Shen, H.; Xiong, J.; Xin, Y.; Jiang, Y.; et al. 2025. SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning. arXiv preprint arXiv:2506.01713

  31. [32]

    Wang, H.; Qu, C.; Huang, Z.; Chu, W.; Lin, F.; and Chen, W. 2025 a . VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning. arXiv preprint arXiv:2504.08837. Published April 10, 2025

  32. [33]

    Wang, P.; Wei, Y.; Peng, Y.; Wang, X.; Qiu, W.; Shen, W.; Xie, T.; Pei, J.; Zhang, J.; Hao, Y.; Song, X.; Liu, Y.; and Zhou, Y. 2025 b . Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning. arXiv preprint arXiv:2504.16656

  33. [34]

    Xia, J.; Zang, Y.; Gao, P.; Li, Y.; and Zhou, K. 2025. Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning. arXiv preprint arXiv:2505.14677

  34. [36]

    Xiao, W.; Zhang, W.; Hu, J.; Chen, R.; and Yang, J. 2025 b . Perception-R1: A Reinforcement Learning Framework for Multimodal Perception with Verifiable Rewards. arXiv preprint arXiv:2506.07218

  35. [37]

    Xiao, Y.; Sun, E.; Liu, T.; and Wang, W. 2024. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973

  36. [38]

    Yang, Y.; He, X.; Pan, H.; Jiang, X.; Deng, Y.; Yang, X.; Lu, H.; Yin, D.; Rao, F.; Zhu, M.; et al. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615

  37. [39]

    Yao, H.; Yin, Q.; Zhang, J.; Yang, M.; Wang, Y.; Wu, W.; Su, F.; Shen, L.; Qiu, M.; Tao, D.; and Huang, J. 2025. R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

  38. [40]

    Yu, Q.; Zhang, Z.; Zhu, R.; Yuan, Y.; Zuo, X.; Yue, Y.; Dai, W.; Fan, T.; Liu, G.; Liu, L.; et al. 2025. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476

  39. [41]

    Yue, X.; Zheng, T.; Ni, Y.; Wang, Y.; Zhang, K.; Tong, S.; Sun, Y.; Yu, B.; Zhang, G.; Sun, H.; et al. 2024. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813

  40. [42]

    Zhang, R.; Jiang, D.; Zhang, Y.; Lin, H.; Guo, Z.; Qiu, P.; Zhou, A.; Lu, P.; Chang, K.; Gao, P.; and Li, H. 2024. MathVerse: Does Your Multi‑modal LLM Truly See the Diagrams in Visual Math Problems? CoRR, abs/2403.14624. Also published in the ECCV 2024 proceedings

  41. [43]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z.; Yang, M.; Hong, J.; Zhao, C.; Xu, G.; Yang, L.; Shen, C.; and Yu, X. 2025. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning. arXiv preprint arXiv:2505.14362

  42. [44]

    Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. 2025 a . Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479

  43. [45]

    Zhu, M.; Zhong, H.; Zhao, C.; Du, Z.; Huang, Z.; Liu, M.; Chen, H.; Zou, C.; Chen, J.; Yang, M.; and Shen, C. 2025 b . Active‑O3: Empowering Multimodal Large Language Models with Active Perception via GRPO. arXiv preprint arXiv:2505.21457