Active-o3: Empowering multimodal large language models with active perception via grpo

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu · 2025 · arXiv 2505.21457

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

Boosting Reasoning in Large Multimodal Models via Activation Replay

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.

Perception-Aware Policy Optimization for Multimodal Reasoning

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

cs.AI · 2026-05-15 · unverdicted · novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

cs.CV · 2025-09-09 · unverdicted · novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

citing papers explorer

Showing 6 of 6 citing papers.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 66
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CV · 2025-11-25 · unverdicted · none · ref 67
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
Perception-Aware Policy Optimization for Multimodal Reasoning cs.CL · 2025-07-08 · unverdicted · none · ref 45
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision cs.CV · 2026-05-19 · unverdicted · none · ref 6
Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding cs.AI · 2026-05-15 · unverdicted · none · ref 55
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 51
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

Active-o3: Empowering multimodal large language models with active perception via grpo

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer