pith. sign in

arxiv: 2607.01191 · v1 · pith:QFFALE5Lnew · submitted 2026-07-01 · 💻 cs.CV

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Pith reviewed 2026-07-02 13:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained visual reasoningperception-reasoning decouplingvision-language modelsrole-aware reinforcement learninghigh-resolution image understandingtwo-stage reasoning process
0
0 comments X

The pith

Explicitly separating perception from reasoning in vision-language models raises accuracy on tasks needing small visual details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-grained visual reasoning improves when models are trained to first act as a Perceiver that identifies and annotates question-relevant image regions, then switch to a Reasoner that answers using the marked image and crops. This separation is enforced during training by alternating perception-focused and reasoning-focused reinforcement learning updates that rely only on the final answer signal. A sympathetic reader would care because existing models often fail when critical cues are small and buried in high-resolution images, and the claim is that an explicit two-stage process aligns training better than joint approaches or test-time cropping.

Core claim

The central claim is that fine-grained visual reasoning is best formulated as a two-stage process in which the model first localizes question-relevant evidence as a Perceiver and then answers the question as a Reasoner based on the annotated image and cropped regions; training this formulation with Perception-Reasoning Alternating GRPO, which alternates role-specific updates using only final-answer supervision, produces consistent gains when applied to Qwen3-VL backbones, reaching 93.2 percent on V-Star, 81.9 percent on HR-Bench-4K, and 80.5 percent on HR-Bench-8K for the 4B variant while also helping broader multimodal tasks.

What carries the argument

The Perceive-to-Reason (P2R) framework, which enforces an explicit two-stage sequence of perception followed by reasoning together with role-aware alternating updates during reinforcement learning.

If this is right

  • The same decoupling produces measurable gains across model sizes from 2B to 8B parameters on high-resolution benchmarks.
  • Benefits appear on both high-resolution detail tasks and wider multimodal reasoning benchmarks.
  • Training requires only final-answer labels yet still aligns the model to separate perception and reasoning behaviors.
  • The approach removes the need for repeated cropping or test-time visual search at inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation could be used to train perception and reasoning modules independently in larger systems.
  • Similar staged training might reduce entanglement in other sequential multimodal tasks such as video or document reasoning.
  • If the pattern holds, single-stage joint training may be leaving performance on the table whenever perception and reasoning have different optimal update frequencies.

Load-bearing premise

The measured gains come from the explicit two-stage decoupling and alternating role updates rather than from extra training compute, data choices, or the reinforcement learning algorithm itself.

What would settle it

A model trained with the same total reinforcement learning steps and data but without the two-stage formulation or without alternating between perception and reasoning roles would match or exceed P2R performance on V-Star and the HR-Bench suites.

Figures

Figures reproduced from arXiv: 2607.01191 by Dingming Li, Haiwen Hong, Hanrong Zhang, Haolei Xu, Hongxing Li, Hui Xue, Jun Xiao, Longtao Huang, Weiming Lu, Wenjing Jiang, Xiufeng Huang, Yongliang Shen, Yueting Zhuang, Zixuan Wang.

Figure 1
Figure 1. Figure 1: Motivation of P2R. Prior methods inject local [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of P2R. (a) Illustration of the proposed two-stage P2R inference pipeline. (b) Performance [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of PRA-GRPO. Training alternates between a perception phase and a reasoning phase under [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of PRA-GRPO training components [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of PRA-GRPO during the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative examples from the V-Star benchmark, comparing Qwen3-VL-4B and P2R-4B. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance over three iterations on MME [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: shows the evaluation accuracy on V-Star, HR-Bench-4K, and HR-Bench-8K across the Per￾ceiver and Reasoner training phases. We observe Perceiver Training Phase Reasoner Training Phase [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Dynamics of bounding box count and size on V-Star during the Perceiver and Reasoner training phases of PRA-GRPO. consistent gains on all three benchmarks, with av￾erage performance steadily improving throughout training. This result suggests that both stages of PRA-GRPO contribute to better fine-grained visual perception on challenging high-resolution images. In particular, the gains continue not only dur… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of inference throughput (sam [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of different bounding box inputs [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A successful case of P2R-4B on the V-Star Attribute. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A successful case of P2R-4B on the V-Star Spatial. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A successful case of P2R-4B on the HR-Bench FSP. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A successful case of P2R-4B on the MME-RealWorld-Lite Perception Remote Sensing. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A successful case of P2R-4B on the MME-RealWorld-Lite Reasoning Diagram and Table. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A successful case of P2R-4B on the MME-RealWorld-Lite Reasoning OCR with Complex Context. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A failure case of P2R-4B on the HR-Bench FSP. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: A failure case of P2R-4B on the MME-RealWorld-Lite Perception Remote Sensing. [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
read the original abstract

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Perceive-to-Reason (P2R), a two-stage framework for fine-grained visual reasoning in VLMs: a Perceiver stage localizes question-relevant evidence in high-resolution images, followed by a Reasoner stage that answers using the annotated image and crops. Training uses Perception-Reasoning Alternating GRPO (PRA-GRPO) with role-aware alternating updates under final-answer supervision only. Built on Qwen3-VL-Instruct backbones (2B/4B/8B), P2R reports consistent gains, including P2R-4B at 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, plus benefits on broader multimodal tasks.

Significance. If the gains can be causally attributed to the explicit perception-reasoning decoupling and alternating updates, the framework offers a practical training recipe that could improve fine-grained reasoning without test-time search or repeated cropping. The approach is notable for using only final-answer supervision and for scaling across model sizes, but its impact depends on whether the reported deltas exceed what generic RL or data choices would produce.

major comments (2)
  1. [Experiments / Results] The central claim—that explicit two-stage decoupling plus role-aware alternating updates drive the reported gains—is not supported by any ablation that holds the underlying RL algorithm (GRPO variant), data mixture, and total training steps fixed while removing either the Perceiver/Reasoner architectural split or the alternating update schedule. Without this control, the deltas (e.g., the 93.2%/81.9%/80.5% numbers for the 4B model) cannot be attributed to the proposed mechanism rather than to other training factors.
  2. [Abstract and §4 (Experiments)] The abstract and method description provide no information on baseline implementations, statistical significance testing, data splits, or variance across runs for the three main benchmarks. This makes it impossible to assess whether the claimed improvements over the corresponding backbones are robust or reproducible.
minor comments (2)
  1. [Method] Notation for the Perceiver and Reasoner roles and the exact form of the alternating update rule should be formalized with equations or pseudocode for clarity.
  2. [Experiments] The paper should clarify whether the reported numbers use the same evaluation protocol and prompts as the backbone baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the attribution of results to the proposed mechanisms and improve reproducibility details.

read point-by-point responses
  1. Referee: [Experiments / Results] The central claim—that explicit two-stage decoupling plus role-aware alternating updates drive the reported gains—is not supported by any ablation that holds the underlying RL algorithm (GRPO variant), data mixture, and total training steps fixed while removing either the Perceiver/Reasoner architectural split or the alternating update schedule. Without this control, the deltas (e.g., the 93.2%/81.9%/80.5% numbers for the 4B model) cannot be attributed to the proposed mechanism rather than to other training factors.

    Authors: We agree that the manuscript currently lacks a fully controlled ablation that isolates the Perceiver/Reasoner split and alternating schedule while exactly matching GRPO variant, data mixture, and total steps. In the revision we will add these experiments (e.g., a single-stage GRPO baseline and a non-alternating joint-update variant) on the same 4B backbone and report the resulting deltas on V-Star, HR-Bench-4K, and HR-Bench-8K. This will allow readers to attribute gains more directly to the decoupling and PRA-GRPO schedule. revision: yes

  2. Referee: [Abstract and §4 (Experiments)] The abstract and method description provide no information on baseline implementations, statistical significance testing, data splits, or variance across runs for the three main benchmarks. This makes it impossible to assess whether the claimed improvements over the corresponding backbones are robust or reproducible.

    Authors: We acknowledge the omission. In the revised §4 we will (1) detail how the Qwen3-VL-Instruct baselines were evaluated (prompting, resolution handling, and decoding settings), (2) specify the exact training/evaluation data splits and any filtering applied, (3) report standard deviation across at least three independent runs for the primary benchmarks, and (4) include a brief note on statistical significance where sample sizes permit. These additions will be reflected in an updated abstract if space allows. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical framework with no derivations

full rationale

The paper introduces an empirical two-stage framework (Perceiver then Reasoner) plus PRA-GRPO training and reports benchmark scores on V-Star, HR-Bench, etc. No equations, fitted parameters, or mathematical derivations appear anywhere in the provided text. The performance claims are direct experimental outcomes rather than quantities derived from or equivalent to the inputs by construction. Self-citations are not load-bearing for any derivation because none exists. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from reinforcement learning applied to VLMs plus one domain-specific assumption about the effectiveness of role alternation; no new physical entities or free parameters beyond model scale are introduced.

axioms (1)
  • domain assumption Alternating perception-focused and reasoning-focused GRPO updates using only final-answer supervision suffices to train the decoupled Perceiver and Reasoner roles.
    This premise is required for the PRA-GRPO method to produce the claimed separation of concerns.

pith-pipeline@v0.9.1-grok · 5840 in / 1326 out tokens · 33693 ms · 2026-07-02T13:22:40.070206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 40 canonical work pages · 25 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  9. [9]

    2025 , url =

    OpenAI , title =. 2025 , url =

  10. [10]

    LLaVA-OneVision: Easy Visual Task Transfer

    Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

  11. [11]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  12. [12]

    Qwen3-VL Technical Report

    Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

  13. [13]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  14. [14]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  15. [15]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  16. [16]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Deepeyes: Incentivizing" thinking with images" via reinforcement learning , author=. arXiv preprint arXiv:2505.14362 , year=

  17. [17]

    DeepEyesV2: Toward Agentic Multimodal Model

    Deepeyesv2: Toward agentic multimodal model , author=. arXiv preprint arXiv:2511.05271 , year=

  18. [18]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

  19. [19]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Mini-o3: Scaling up reasoning patterns and interaction turns for visual search , author=. arXiv preprint arXiv:2509.07969 , year=

  20. [20]

    Thyme: Think Beyond Images

    Thyme: Think beyond images , author=. arXiv preprint arXiv:2508.11630 , year=

  21. [21]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

  22. [22]

    arXiv preprint arXiv:2504.07954 , year=

    Perception-r1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

  23. [23]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  24. [24]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? , author=. arXiv preprint arXiv:2408.13257 , year=

  25. [25]

    arXiv preprint arXiv:2602.11858 , year=

    Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception , author=. arXiv preprint arXiv:2602.11858 , year=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    arXiv preprint arXiv:2403.12966 , year=

    Chain-of-spot: Interactive reasoning improves large vision-language models , author=. arXiv preprint arXiv:2403.12966 , year=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    GRIT: Teaching MLLMs to Think with Images

    Grit: Teaching mllms to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

  30. [30]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Visual-rft: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

  31. [31]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

  32. [32]

    CoRR , year=

    Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

  33. [33]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

  34. [34]

    arXiv preprint arXiv:2602.12916 , year=

    Reliable thinking with images , author=. arXiv preprint arXiv:2602.12916 , year=

  35. [35]

    HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

    HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling , author=. arXiv preprint arXiv:2510.00054 , year=

  36. [36]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  37. [37]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  38. [38]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  39. [39]

    Kullback-leibler divergence , author=. Tech. Rep. , year=

  40. [40]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  41. [41]

    arXiv preprint arXiv:2506.01663 , year=

    Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement , author=. arXiv preprint arXiv:2506.01663 , year=

  42. [42]

    arXiv preprint arXiv:2510.18876 , year=

    Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs , author=. arXiv preprint arXiv:2510.18876 , year=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    Spatialladder: Progressive training for spatial reasoning in vision-language models,

    Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

  45. [45]

    arXiv preprint arXiv:2505.21500 , year=

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models , author=. arXiv preprint arXiv:2505.21500 , year=

  46. [46]

    International Conference on Learning Representations , volume=

    Mllms know where to look: Training-free perception of small visual details with multimodal llms , author=. International Conference on Learning Representations , volume=

  47. [47]

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    AdaTooler-V: Adaptive Tool-Use for Images and Videos , author=. arXiv preprint arXiv:2512.16918 , year=

  48. [48]

    arXiv preprint arXiv:2507.07998 , year=

    Pyvision: Agentic vision with dynamic tooling , author=. arXiv preprint arXiv:2507.07998 , year=

  49. [49]

    GroundAct: Can LLM Agents Ground Actions in Environmental States?

    Omniear: Benchmarking agent reasoning in embodied tasks , author=. arXiv preprint arXiv:2508.05614 , year=

  50. [50]

    Proceedings of the Twentieth European Conference on Computer Systems , pages=

    Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

  51. [51]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  52. [52]

    Advances in Neural Information Processing Systems , volume=

    Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

  53. [53]

    arXiv preprint arXiv:2512.21625 , year=

    Rethinking sample polarity in reinforcement learning with verifiable rewards , author=. arXiv preprint arXiv:2512.21625 , year=

  54. [54]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization , author=. arXiv preprint arXiv:2601.05242 , year=

  55. [55]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  56. [56]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Learning to reason under off-policy guidance , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    arXiv preprint arXiv:2602.20739 , year=

    PyVision-RL: Forging Open Agentic Vision Models via RL , author=. arXiv preprint arXiv:2602.20739 , year=

  59. [59]

    arXiv preprint arXiv:2509.25916 , year=

    Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms , author=. arXiv preprint arXiv:2509.25916 , year=

  60. [60]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement , author=. arXiv preprint arXiv:2503.06520 , year=

  61. [61]

    Advances in Neural Information Processing Systems , volume=

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  62. [62]

    arXiv preprint arXiv:2509.13031 , year=

    Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. arXiv preprint arXiv:2509.13031 , year=

  63. [63]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  64. [64]

    arXiv preprint arXiv:2506.19767 , year=

    Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning , author=. arXiv preprint arXiv:2506.19767 , year=