GRPO is Secretly a Process Reward Model

Alexander Koller; Michael Sullivan

arxiv: 2509.21154 · v4 · pith:IVEMD2RMnew · submitted 2025-09-25 · 💻 cs.LG · cs.AI

GRPO is Secretly a Process Reward Model

Michael Sullivan , Alexander Koller This is my paper

classification 💻 cs.LG cs.AI

keywords grporewardalgorithmprocessequippedlambdallmsmodel

0 comments

read the original abstract

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs tuned with $\lambda$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
cs.LG 2026-06 unverdicted novelty 7.0

GRPO, Dr. GRPO, and DAPO are three settings of one dial on the group standard deviation of binary rewards, unified by the group-standard-deviation identity where disagreement equals update magnitude.
Rethinking Groups in Critic-Free RLVR
cs.LG 2026-06 unverdicted novelty 6.0

Negative token filtering enables single-rollout critic-free RL training by avoiding false penalties on negative samples, matching group-based methods on reasoning tasks and exceeding them on agentic tasks.