pith. machine review for the scientific record. sign in

arxiv: 2505.14362 · v3 · submitted 2025-05-20 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsreinforcement learningactive perceptionmultimodal reasoningvisual groundingthinking with imageshallucination reduction
0
0 comments X

The pith

Reinforcement learning lets vision-language models develop native image-based reasoning without pre-collected data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a vision-language model can acquire the capacity to think with images by using reinforcement learning to foster active perception. This process relies on the model's own grounding abilities and a custom data selection plus reward design rather than any initial supervised fine-tuning on reasoning examples. A sympathetic reader would care because the resulting behavior produces measurable gains on perception and reasoning tasks while also cutting hallucinations and aiding mathematical work. The training trace reveals the model shifting from broad visual exploration toward precise, efficient exploitation of image information. In short, the claim is that image-grounded reasoning can emerge as an intrinsic, reward-shaped skill instead of an externally supplied one.

Core claim

DeepEyes trains a vision-language model end-to-end with reinforcement learning so that it learns to think with images through active perception, using its intrinsic grounding capability rather than external tools or pre-collected reasoning data. A tailored data selection and reward strategy steers the model to strategically ground its reasoning in visual content. The outcome is significant gains on general perception and reasoning benchmarks together with better grounding, lower hallucination rates, and stronger mathematical reasoning. During training the model passes through distinct stages: initial exploratory perception gives way to efficient and accurate exploitation, accompanied by a多样化

What carries the argument

Active perception, the learned strategy by which the model decides when and how to ground its ongoing reasoning directly in visual information.

If this is right

  • Performance improves on perception and reasoning benchmarks without any pre-collected reasoning traces.
  • Grounding accuracy rises while hallucination rates fall, including on mathematical reasoning tasks.
  • The model exhibits an internal progression from exploratory to exploitative visual behavior.
  • Diverse thinking patterns appear that parallel human visual reasoning sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement-learning incentive structure could be tested on video or audio sequences to induce analogous active-perception loops.
  • If the approach scales, training pipelines for multimodal models may require far less curated reasoning data than current supervised routes.
  • Longer-horizon tasks could reveal whether the emergent perception strategies remain stable or require additional reward shaping.
  • Real-world deployment in dynamic environments would test whether the learned visual-grounding habits transfer beyond static benchmark images.

Load-bearing premise

The custom reward and data selection rules will steer the model toward genuine, useful visual grounding rather than superficial patterns that merely maximize the reward signal.

What would settle it

Run the same reinforcement learning loop with the visual-grounding reward terms removed or replaced by generic accuracy rewards; if benchmark gains and the reported evolution of perception behavior remain unchanged, the claim that active perception drives the improvements is falsified.

read the original abstract

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeepEyes, a vision-language model trained end-to-end via reinforcement learning to develop native 'thinking with images' capability through active perception. It claims this emerges without any cold-start supervised fine-tuning on pre-collected reasoning data, relying instead on tailored data selection and a custom reward strategy that leverages the model's intrinsic grounding. The approach reportedly yields significant gains on general perception and reasoning benchmarks, plus improvements in grounding, hallucination reduction, and mathematical reasoning, with observed behavioral evolution from exploration to exploitation and diverse human-like thinking patterns.

Significance. If the central claims hold under rigorous verification, the work would be moderately significant for multimodal AI research. It offers an empirical demonstration that RL can elicit integrated visual reasoning in VLMs without heavy reliance on SFT or external tools, potentially reducing data curation costs and enabling more autonomous active perception. The public code release is a clear strength for reproducibility.

major comments (3)
  1. [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
  2. [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
  3. [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.
minor comments (2)
  1. [Abstract] The abstract and introduction use the phrase 'significant performance gains' without defining the term or providing supporting numbers; this should be replaced with concrete metrics or removed.
  2. [Methods] Notation for the active perception loop (e.g., how visual grounding actions are interleaved with text reasoning) is introduced informally; a clear algorithmic pseudocode or diagram would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the opportunity to clarify the presentation of our results, methods, and analyses. We address each major comment below and commit to revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.

    Authors: We agree that explicit quantitative comparisons are necessary to substantiate the claims. In the revised manuscript we will add tables reporting baseline scores, absolute and relative performance deltas, error bars from multiple runs, and statistical significance tests. We will also include a discussion comparing the observed gains against what data curation alone can achieve, thereby clarifying the contribution of the RL objective. revision: yes

  2. Referee: [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.

    Authors: We acknowledge that the reward formulation was presented at too high a level. The revised Methods section will contain the complete mathematical definition of the reward, explicitly listing each component (grounding accuracy, reasoning utility, format compliance) together with the weighting coefficients. This will enable readers to evaluate convergence behavior and rule out superficial reward hacking. revision: yes

  3. Referee: [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.

    Authors: We agree that additional ablations and quantitative tracking would strengthen the mechanistic claims. The revision will include ablation experiments that compare full RL training against data-selection-only baselines, as well as plots of grounding utility and exploration/exploitation metrics across training steps. Full causal interventions (forcing or ablating visual thought steps) would require new controlled runs; we will therefore provide enhanced observational analysis and discuss the limits of the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL training with external benchmarks

full rationale

The paper presents an empirical end-to-end RL method for training VLMs to perform active perception and 'think with images' without cold-start SFT. Claims rest on performance gains measured against external perception/reasoning benchmarks and observed behavioral evolution during training. No mathematical derivations, equations, or self-referential definitions are present that would reduce any result to its inputs by construction. The approach is self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard RL assumptions and the pre-existing grounding capability of the base VLM; no new physical entities or ad-hoc constants are introduced.

pith-pipeline@v0.9.0 · 5496 in / 1029 out tokens · 41197 ms · 2026-05-11T14:37:27.553701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...

  3. UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.

  4. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  5. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...

  6. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  7. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  8. Act2See: Emergent Active Visual Perception for Video Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

  9. Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

  10. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  11. OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.

  12. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  13. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  14. Venus-DeFakerOne: Unified Fake Image Detection & Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.

  15. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  16. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  17. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...

  18. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  19. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  20. AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.

  21. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  22. See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

    cs.CV 2026-04 unverdicted novelty 6.0

    ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

  23. SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

  24. Visual Reasoning through Tool-supervised Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

  25. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  26. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  27. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  28. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  29. AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

  30. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  31. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  32. LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

  33. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  34. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...

  35. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  36. Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.

  37. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

  38. SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

    cs.IR 2026-04 unverdicted novelty 5.0

    SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.

  39. Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

    cs.CV 2026-04 unverdicted novelty 5.0

    Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...

  40. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  41. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  42. Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

    cs.CV 2026-04 unverdicted novelty 5.0

    TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...

  43. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  44. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  45. MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

  46. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  47. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 42 Pith papers · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  4. [4]

    Shapiro, and Ranjay Krishna

    Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548,

  5. [5]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model.arXiv e-prints, pp. arXiv–2406, 2024a. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasonin...

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    10 Published as a conference paper at ICLR 2026 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Zhe Chen, Jiannan Wu, Wenhai Wang, ...

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jia...

  8. [8]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,

  9. [9]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visua...

  10. [10]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp. 740–755. Springer,

  11. [11]

    arXiv preprint arXiv:2503.06520 (2025)

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowled...

  12. [12]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

  13. [13]

    Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,

    Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,

  14. [14]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365,

  15. [15]

    s1: Simple test-time scaling

    12 Published as a conference paper at ICLR 2026 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

  16. [16]

    arXiv preprint arXiv:2503.07536 , year =

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,

  17. [17]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

  18. [18]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song...

  19. [19]

    Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b

    Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b. Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowl...

  20. [20]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: In- centivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,

  21. [21]

    Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024

    Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862,

  22. [22]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi...

  23. [23]

    Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,

  24. [24]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

  25. [25]

    Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,

    Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,

  26. [26]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  27. [27]

    The dawn of lmms: Preliminary explorations with gpt-4v (ision)

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,

  28. [28]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024a. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. ...

  29. [29]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 13040–13051, 2024b. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang,...

  30. [30]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  31. [31]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

  32. [32]

    type": "function

    A PROMPT A.1 SYSTEMPROMPT SYSTEM_PROMPT You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags:,→ <tools> { "type": "function", "function": { "name": "image_zoom_in_tool", "description": "Zoom in on a specific region of an image by cro...