hub Canonical reference

Perception-Aware Policy Optimization for Multimodal Reasoning

· 2025 · cs.CL · arXiv 2507.06448

Canonical reference. 80% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 24 citing papers arXiv PDF

abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1

citation-polarity summary

background 8 baseline 1 use method 1

representative citing papers

Visual-Advantage On-Policy Distillation for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

cs.CL · 2026-01-11 · unverdicted · novelty 7.0

Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

CFPO is a counterfactual policy optimization method that regularizes RL policies in LVLMs by maximizing prediction discrepancy under suppressed visual cues, reporting 3-6% gains over baselines.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 6.0

DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

Decomposes VLM distillation loss into orthogonal language and visual components and introduces Visual Gradient Steering to prioritize visual grounding over standard monolithic optimization.

Reinforcing Multimodal Reasoning Against Visual Degradation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

cs.CV · 2026-05-05 · unverdicted · novelty 6.0

MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

cs.LG · 2026-02-20 · conditional · novelty 6.0 · 2 refs

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

cs.CV · 2026-02-07 · unverdicted · novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning

cs.CV · 2026-06-26 · unverdicted · novelty 5.0

RSICCLLM introduces a post-training framework with RSICI dataset, difference-aware supervised fine-tuning, and dual-negative preference optimization that claims to outperform much larger models on remote sensing image change captioning.

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

cs.CV · 2026-06-24 · unverdicted · novelty 5.0

VIGIL is a counterfactual RL alignment method that reduces visual hallucinations in MLLMs by enforcing visual grounding via masked attention penalties, outperforming baselines with 25% of the data and showing emergent spatial capabilities.

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

cs.CV · 2026-06-24 · unverdicted · novelty 5.0

V-Zero trains MLLMs for visual reasoning without answer labels by gating on-policy distillation trajectories using contrastive evidence from relevant versus negative image crops.

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

cs.AI · 2026-06-22 · unverdicted · novelty 5.0

VeriEvol decouples prompt difficulty evolution from answer reliability verification to scale verified data for visual math reasoning, lifting benchmark accuracy from 35.42 to 54.73 and adding +3.88 in GRPO RL.

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

cs.AI · 2026-06-16 · unverdicted · novelty 5.0

MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathematical reasoning.

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual refinement in image generation.

Rethinking Agentic Reinforcement Learning In Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 3.0 · 3 refs

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 35 · internal anchor
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CV · 2026-04-12 · unverdicted · none · ref 65 · 3 links · internal anchor
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement cs.CV · 2026-04-18 · unverdicted · none · ref 44 · internal anchor
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual refinement in image generation.

Perception-Aware Policy Optimization for Multimodal Reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer