Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Dingming Li; Haiwen Hong; Hanrong Zhang; Haolei Xu; Hongxing Li; Hui Xue; Jun Xiao; Longtao Huang; Weiming Lu; Wenjing Jiang

arxiv: 2607.01191 · v1 · pith:QFFALE5Lnew · submitted 2026-07-01 · 💻 cs.CV

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Hongxing Li , Xiufeng Huang , Dingming Li , Wenjing Jiang , Zixuan Wang , Haolei Xu , Hanrong Zhang , Haiwen Hong

show 6 more authors

Longtao Huang Hui Xue Weiming Lu Jun Xiao Yueting Zhuang Yongliang Shen

This is my paper

Pith reviewed 2026-07-02 13:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained visual reasoningperception-reasoning decouplingvision-language modelsrole-aware reinforcement learninghigh-resolution image understandingtwo-stage reasoning process

0 comments

The pith

Explicitly separating perception from reasoning in vision-language models raises accuracy on tasks needing small visual details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-grained visual reasoning improves when models are trained to first act as a Perceiver that identifies and annotates question-relevant image regions, then switch to a Reasoner that answers using the marked image and crops. This separation is enforced during training by alternating perception-focused and reasoning-focused reinforcement learning updates that rely only on the final answer signal. A sympathetic reader would care because existing models often fail when critical cues are small and buried in high-resolution images, and the claim is that an explicit two-stage process aligns training better than joint approaches or test-time cropping.

Core claim

The central claim is that fine-grained visual reasoning is best formulated as a two-stage process in which the model first localizes question-relevant evidence as a Perceiver and then answers the question as a Reasoner based on the annotated image and cropped regions; training this formulation with Perception-Reasoning Alternating GRPO, which alternates role-specific updates using only final-answer supervision, produces consistent gains when applied to Qwen3-VL backbones, reaching 93.2 percent on V-Star, 81.9 percent on HR-Bench-4K, and 80.5 percent on HR-Bench-8K for the 4B variant while also helping broader multimodal tasks.

What carries the argument

The Perceive-to-Reason (P2R) framework, which enforces an explicit two-stage sequence of perception followed by reasoning together with role-aware alternating updates during reinforcement learning.

If this is right

The same decoupling produces measurable gains across model sizes from 2B to 8B parameters on high-resolution benchmarks.
Benefits appear on both high-resolution detail tasks and wider multimodal reasoning benchmarks.
Training requires only final-answer labels yet still aligns the model to separate perception and reasoning behaviors.
The approach removes the need for repeated cropping or test-time visual search at inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation could be used to train perception and reasoning modules independently in larger systems.
Similar staged training might reduce entanglement in other sequential multimodal tasks such as video or document reasoning.
If the pattern holds, single-stage joint training may be leaving performance on the table whenever perception and reasoning have different optimal update frequencies.

Load-bearing premise

The measured gains come from the explicit two-stage decoupling and alternating role updates rather than from extra training compute, data choices, or the reinforcement learning algorithm itself.

What would settle it

A model trained with the same total reinforcement learning steps and data but without the two-stage formulation or without alternating between perception and reasoning roles would match or exceed P2R performance on V-Star and the HR-Bench suites.

Figures

Figures reproduced from arXiv: 2607.01191 by Dingming Li, Haiwen Hong, Hanrong Zhang, Haolei Xu, Hongxing Li, Hui Xue, Jun Xiao, Longtao Huang, Weiming Lu, Wenjing Jiang, Xiufeng Huang, Yongliang Shen, Yueting Zhuang, Zixuan Wang.

**Figure 2.** Figure 2: Overview of P2R. (a) Illustration of the proposed two-stage P2R inference pipeline. (b) Performance [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of PRA-GRPO. Training alternates between a perception phase and a reasoning phase under [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Ablation of PRA-GRPO training components [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of PRA-GRPO during the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Representative examples from the V-Star benchmark, comparing Qwen3-VL-4B and P2R-4B. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Performance over three iterations on MME [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: shows the evaluation accuracy on V-Star, HR-Bench-4K, and HR-Bench-8K across the Perceiver and Reasoner training phases. We observe Perceiver Training Phase Reasoner Training Phase [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Dynamics of bounding box count and size on V-Star during the Perceiver and Reasoner training phases of PRA-GRPO. consistent gains on all three benchmarks, with average performance steadily improving throughout training. This result suggests that both stages of PRA-GRPO contribute to better fine-grained visual perception on challenging high-resolution images. In particular, the gains continue not only dur… view at source ↗

**Figure 11.** Figure 11: Comparison of inference throughput (sam [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of different bounding box inputs [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: A successful case of P2R-4B on the V-Star Attribute. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: A successful case of P2R-4B on the V-Star Spatial. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: A successful case of P2R-4B on the HR-Bench FSP. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: A successful case of P2R-4B on the MME-RealWorld-Lite Perception Remote Sensing. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: A successful case of P2R-4B on the MME-RealWorld-Lite Reasoning Diagram and Table. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: A successful case of P2R-4B on the MME-RealWorld-Lite Reasoning OCR with Complex Context. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: A failure case of P2R-4B on the HR-Bench FSP. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: A failure case of P2R-4B on the MME-RealWorld-Lite Perception Remote Sensing. [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

read the original abstract

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P2R adds an explicit Perceiver-Reasoner split and alternating PRA-GRPO but the reported gains lack controls that isolate the decoupling from the rest of the training recipe.

read the letter

The main thing here is the clean two-stage split where the model first localizes evidence as a Perceiver then answers as a Reasoner, trained via role-aware alternating updates in PRA-GRPO. They apply this to Qwen3-VL backbones and report clear lifts on V-Star and the HR-Bench sets, with the 4B version at 93.2%, 81.9%, and 80.5% respectively.

What is actually new is the explicit decoupling plus the alternating schedule that focuses updates on one role at a time while using only final-answer supervision. This differs from the repeated cropping or test-time search baselines cited in the abstract, and the modularity looks practical for high-resolution cases where small cues matter.

The numbers across model scales and the extension to other multimodal tasks are the concrete evidence offered. The formulation is straightforward and the training alignment with the two stages is a reasonable design choice.

The soft spot is exactly the one in the stress-test note. No experiment holds the underlying RL algorithm, data mixture, and total steps fixed while removing the Perceiver/Reasoner split or the alternating schedule. That leaves open whether the gains come from the claimed decoupling or from generic effects of the GRPO variant and extra training. The abstract also skips baseline details and stats, so the full paper needs to show those controls and comparisons to make the attribution stick.

This is for people working on fine-grained visual reasoning or modular VLM training. A reader who wants a concrete recipe to test on other backbones would get something usable from it. It deserves peer review because the idea is clear, the empirical deltas are large enough to check, and the gaps are fixable with targeted ablations.

Referee Report

2 major / 2 minor

Summary. The paper proposes Perceive-to-Reason (P2R), a two-stage framework for fine-grained visual reasoning in VLMs: a Perceiver stage localizes question-relevant evidence in high-resolution images, followed by a Reasoner stage that answers using the annotated image and crops. Training uses Perception-Reasoning Alternating GRPO (PRA-GRPO) with role-aware alternating updates under final-answer supervision only. Built on Qwen3-VL-Instruct backbones (2B/4B/8B), P2R reports consistent gains, including P2R-4B at 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, plus benefits on broader multimodal tasks.

Significance. If the gains can be causally attributed to the explicit perception-reasoning decoupling and alternating updates, the framework offers a practical training recipe that could improve fine-grained reasoning without test-time search or repeated cropping. The approach is notable for using only final-answer supervision and for scaling across model sizes, but its impact depends on whether the reported deltas exceed what generic RL or data choices would produce.

major comments (2)

[Experiments / Results] The central claim—that explicit two-stage decoupling plus role-aware alternating updates drive the reported gains—is not supported by any ablation that holds the underlying RL algorithm (GRPO variant), data mixture, and total training steps fixed while removing either the Perceiver/Reasoner architectural split or the alternating update schedule. Without this control, the deltas (e.g., the 93.2%/81.9%/80.5% numbers for the 4B model) cannot be attributed to the proposed mechanism rather than to other training factors.
[Abstract and §4 (Experiments)] The abstract and method description provide no information on baseline implementations, statistical significance testing, data splits, or variance across runs for the three main benchmarks. This makes it impossible to assess whether the claimed improvements over the corresponding backbones are robust or reproducible.

minor comments (2)

[Method] Notation for the Perceiver and Reasoner roles and the exact form of the alternating update rule should be formalized with equations or pseudocode for clarity.
[Experiments] The paper should clarify whether the reported numbers use the same evaluation protocol and prompts as the backbone baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the attribution of results to the proposed mechanisms and improve reproducibility details.

read point-by-point responses

Referee: [Experiments / Results] The central claim—that explicit two-stage decoupling plus role-aware alternating updates drive the reported gains—is not supported by any ablation that holds the underlying RL algorithm (GRPO variant), data mixture, and total training steps fixed while removing either the Perceiver/Reasoner architectural split or the alternating update schedule. Without this control, the deltas (e.g., the 93.2%/81.9%/80.5% numbers for the 4B model) cannot be attributed to the proposed mechanism rather than to other training factors.

Authors: We agree that the manuscript currently lacks a fully controlled ablation that isolates the Perceiver/Reasoner split and alternating schedule while exactly matching GRPO variant, data mixture, and total steps. In the revision we will add these experiments (e.g., a single-stage GRPO baseline and a non-alternating joint-update variant) on the same 4B backbone and report the resulting deltas on V-Star, HR-Bench-4K, and HR-Bench-8K. This will allow readers to attribute gains more directly to the decoupling and PRA-GRPO schedule. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract and method description provide no information on baseline implementations, statistical significance testing, data splits, or variance across runs for the three main benchmarks. This makes it impossible to assess whether the claimed improvements over the corresponding backbones are robust or reproducible.

Authors: We acknowledge the omission. In the revised §4 we will (1) detail how the Qwen3-VL-Instruct baselines were evaluated (prompting, resolution handling, and decoding settings), (2) specify the exact training/evaluation data splits and any filtering applied, (3) report standard deviation across at least three independent runs for the primary benchmarks, and (4) include a brief note on statistical significance where sample sizes permit. These additions will be reflected in an updated abstract if space allows. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical framework with no derivations

full rationale

The paper introduces an empirical two-stage framework (Perceiver then Reasoner) plus PRA-GRPO training and reports benchmark scores on V-Star, HR-Bench, etc. No equations, fitted parameters, or mathematical derivations appear anywhere in the provided text. The performance claims are direct experimental outcomes rather than quantities derived from or equivalent to the inputs by construction. Self-citations are not load-bearing for any derivation because none exists. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from reinforcement learning applied to VLMs plus one domain-specific assumption about the effectiveness of role alternation; no new physical entities or free parameters beyond model scale are introduced.

axioms (1)

domain assumption Alternating perception-focused and reasoning-focused GRPO updates using only final-answer supervision suffices to train the decoupled Perceiver and Reasoner roles.
This premise is required for the PRA-GRPO method to produce the claimed separation of concerns.

pith-pipeline@v0.9.1-grok · 5840 in / 1326 out tokens · 33693 ms · 2026-07-02T13:22:40.070206+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 40 canonical work pages · 25 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

2025 , url =

OpenAI , title =. 2025 , url =

2025
[10]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[14]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[15]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[16]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Deepeyes: Incentivizing" thinking with images" via reinforcement learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

DeepEyesV2: Toward Agentic Multimodal Model

Deepeyesv2: Toward agentic multimodal model , author=. arXiv preprint arXiv:2511.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search , author=. arXiv preprint arXiv:2509.07969 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Thyme: Think Beyond Images

Thyme: Think beyond images , author=. arXiv preprint arXiv:2508.11630 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2504.07954 , year=

Perception-r1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

work page arXiv
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? , author=. arXiv preprint arXiv:2408.13257 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2602.11858 , year=

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception , author=. arXiv preprint arXiv:2602.11858 , year=

work page arXiv
[26]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=
[27]

arXiv preprint arXiv:2403.12966 , year=

Chain-of-spot: Interactive reasoning improves large vision-language models , author=. arXiv preprint arXiv:2403.12966 , year=

work page arXiv
[28]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=
[29]

GRIT: Teaching MLLMs to Think with Images

Grit: Teaching mllms to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-rft: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

CoRR , year=

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=
[33]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2602.12916 , year=

Reliable thinking with images , author=. arXiv preprint arXiv:2602.12916 , year=

work page arXiv
[35]

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling , author=. arXiv preprint arXiv:2510.00054 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[39]

Kullback-leibler divergence , author=. Tech. Rep. , year=
[40]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[41]

arXiv preprint arXiv:2506.01663 , year=

Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement , author=. arXiv preprint arXiv:2506.01663 , year=

work page arXiv
[42]

arXiv preprint arXiv:2510.18876 , year=

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs , author=. arXiv preprint arXiv:2510.18876 , year=

work page arXiv
[43]

Advances in Neural Information Processing Systems , volume=

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation , author=. Advances in Neural Information Processing Systems , volume=
[44]

Spatialladder: Progressive training for spatial reasoning in vision-language models,

Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

work page arXiv
[45]

arXiv preprint arXiv:2505.21500 , year=

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models , author=. arXiv preprint arXiv:2505.21500 , year=

work page arXiv
[46]

International Conference on Learning Representations , volume=

Mllms know where to look: Training-free perception of small visual details with multimodal llms , author=. International Conference on Learning Representations , volume=
[47]

AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V: Adaptive Tool-Use for Images and Videos , author=. arXiv preprint arXiv:2512.16918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2507.07998 , year=

Pyvision: Agentic vision with dynamic tooling , author=. arXiv preprint arXiv:2507.07998 , year=

work page arXiv
[49]

GroundAct: Can LLM Agents Ground Actions in Environmental States?

Omniear: Benchmarking agent reasoning in embodied tasks , author=. arXiv preprint arXiv:2508.05614 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[51]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[52]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=
[53]

arXiv preprint arXiv:2512.21625 , year=

Rethinking sample polarity in reinforcement learning with verifiable rewards , author=. arXiv preprint arXiv:2512.21625 , year=

work page arXiv
[54]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization , author=. arXiv preprint arXiv:2601.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in Neural Information Processing Systems , volume=

Learning to reason under off-policy guidance , author=. Advances in Neural Information Processing Systems , volume=
[58]

arXiv preprint arXiv:2602.20739 , year=

PyVision-RL: Forging Open Agentic Vision Models via RL , author=. arXiv preprint arXiv:2602.20739 , year=

work page arXiv
[59]

arXiv preprint arXiv:2509.25916 , year=

Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms , author=. arXiv preprint arXiv:2509.25916 , year=

work page arXiv
[60]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement , author=. arXiv preprint arXiv:2503.06520 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Advances in Neural Information Processing Systems , volume=

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[62]

arXiv preprint arXiv:2509.13031 , year=

Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. arXiv preprint arXiv:2509.13031 , year=

work page arXiv
[63]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2506.19767 , year=

Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning , author=. arXiv preprint arXiv:2506.19767 , year=

work page arXiv

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

2025 , url =

OpenAI , title =. 2025 , url =

2025

[10] [10]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[14] [14]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[15] [15]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[16] [16]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Deepeyes: Incentivizing" thinking with images" via reinforcement learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

DeepEyesV2: Toward Agentic Multimodal Model

Deepeyesv2: Toward agentic multimodal model , author=. arXiv preprint arXiv:2511.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search , author=. arXiv preprint arXiv:2509.07969 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Thyme: Think Beyond Images

Thyme: Think beyond images , author=. arXiv preprint arXiv:2508.11630 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv preprint arXiv:2504.07954 , year=

Perception-r1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

work page arXiv

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[24] [24]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? , author=. arXiv preprint arXiv:2408.13257 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2602.11858 , year=

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception , author=. arXiv preprint arXiv:2602.11858 , year=

work page arXiv

[26] [26]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

[27] [27]

arXiv preprint arXiv:2403.12966 , year=

Chain-of-spot: Interactive reasoning improves large vision-language models , author=. arXiv preprint arXiv:2403.12966 , year=

work page arXiv

[28] [28]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=

[29] [29]

GRIT: Teaching MLLMs to Think with Images

Grit: Teaching mllms to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-rft: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

CoRR , year=

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

[33] [33]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2602.12916 , year=

Reliable thinking with images , author=. arXiv preprint arXiv:2602.12916 , year=

work page arXiv

[35] [35]

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling , author=. arXiv preprint arXiv:2510.00054 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[39] [39]

Kullback-leibler divergence , author=. Tech. Rep. , year=

[40] [40]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[41] [41]

arXiv preprint arXiv:2506.01663 , year=

Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement , author=. arXiv preprint arXiv:2506.01663 , year=

work page arXiv

[42] [42]

arXiv preprint arXiv:2510.18876 , year=

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs , author=. arXiv preprint arXiv:2510.18876 , year=

work page arXiv

[43] [43]

Advances in Neural Information Processing Systems , volume=

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

Spatialladder: Progressive training for spatial reasoning in vision-language models,

Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

work page arXiv

[45] [45]

arXiv preprint arXiv:2505.21500 , year=

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models , author=. arXiv preprint arXiv:2505.21500 , year=

work page arXiv

[46] [46]

International Conference on Learning Representations , volume=

Mllms know where to look: Training-free perception of small visual details with multimodal llms , author=. International Conference on Learning Representations , volume=

[47] [47]

AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V: Adaptive Tool-Use for Images and Videos , author=. arXiv preprint arXiv:2512.16918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2507.07998 , year=

Pyvision: Agentic vision with dynamic tooling , author=. arXiv preprint arXiv:2507.07998 , year=

work page arXiv

[49] [49]

GroundAct: Can LLM Agents Ground Actions in Environmental States?

Omniear: Benchmarking agent reasoning in embodied tasks , author=. arXiv preprint arXiv:2508.05614 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

[51] [51]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

[52] [52]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

[53] [53]

arXiv preprint arXiv:2512.21625 , year=

Rethinking sample polarity in reinforcement learning with verifiable rewards , author=. arXiv preprint arXiv:2512.21625 , year=

work page arXiv

[54] [54]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization , author=. arXiv preprint arXiv:2601.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Advances in Neural Information Processing Systems , volume=

Learning to reason under off-policy guidance , author=. Advances in Neural Information Processing Systems , volume=

[58] [58]

arXiv preprint arXiv:2602.20739 , year=

PyVision-RL: Forging Open Agentic Vision Models via RL , author=. arXiv preprint arXiv:2602.20739 , year=

work page arXiv

[59] [59]

arXiv preprint arXiv:2509.25916 , year=

Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms , author=. arXiv preprint arXiv:2509.25916 , year=

work page arXiv

[60] [60]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement , author=. arXiv preprint arXiv:2503.06520 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Advances in Neural Information Processing Systems , volume=

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[62] [62]

arXiv preprint arXiv:2509.13031 , year=

Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. arXiv preprint arXiv:2509.13031 , year=

work page arXiv

[63] [63]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2506.19767 , year=

Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning , author=. arXiv preprint arXiv:2506.19767 , year=

work page arXiv