A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congmin Zheng; Jiachen Zhu; Jianghao Lin; Kangning Zhang; Mengyue Yang; Rong Shan; Weinan Zhang; Yong Yu; Yuxiang Chen; Zeyu Zheng

arxiv: 2510.08049 · v3 · submitted 2025-10-09 · 💻 cs.CL · cs.AI

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congmin Zheng , Jiachen Zhu , Zhuoying Ou , Yuxiang Chen , Kangning Zhang , Rong Shan , Zeyu Zheng , Mengyue Yang

show 3 more authors

Jianghao Lin Yong Yu Weinan Zhang

This is my paper

Pith reviewed 2026-05-18 09:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Process Reward ModelsLarge Language ModelsReasoning AlignmentReinforcement LearningTest-time ScalingProcess SupervisionOutcome Reward Models

0 comments

The pith

Process Reward Models evaluate and guide LLM reasoning at the step level rather than only judging final answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey systematically reviews Process Reward Models that assess reasoning steps or trajectories in large language models. It traces the full pipeline from generating process supervision data through model construction to applications in test-time scaling and reinforcement learning. The review covers uses across mathematics, code generation, text reasoning, multimodal tasks, robotics, and agents while cataloging new benchmarks. A sympathetic reader would care because standard alignment relies on outcome signals that miss errors in intermediate reasoning, and process-level feedback could produce more reliable and controllable model behavior.

Core claim

Process Reward Models address the limitation of outcome-only alignment by providing supervision at the level of individual reasoning steps or full trajectories, and this survey synthesizes methods for creating process data, training such models, and deploying them for scaling and reinforcement learning across multiple domains to clarify design choices and surface open problems in fine-grained reasoning alignment.

What carries the argument

Process Reward Models (PRMs) that assign rewards or supervision signals to intermediate steps and trajectories rather than only to final outputs.

Load-bearing premise

The survey assumes that the existing body of work on PRMs is sufficiently mature and representative to support a meaningful synthesis that clarifies design spaces and reveals open challenges.

What would settle it

A controlled study across several reasoning domains that finds no reliable accuracy gain from adding process supervision once total compute and outcome reward strength are matched would undermine the practical value of shifting focus to PRMs.

read the original abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent survey that organizes the PRM literature into a clear pipeline, but adds no new methods or data and its value hinges on coverage depth.

read the letter

The main thing to know is that this paper is a literature survey pulling together work on process reward models for LLMs. It walks through data generation, model construction, test-time use, and reinforcement learning applications, then maps those ideas onto math, code, multimodal, robotics, and agent settings. That structure is the real contribution here. It gives readers a single place to see how the field has moved from outcome-only signals to step-level supervision, and it flags some open questions around benchmarks and robustness. That kind of map can save time for people entering the area or trying to place their own experiments. The authors do a reasonable job laying out the design choices at each stage without overclaiming novelty. The writing stays focused on the technical loop rather than hype. On the downside, the survey rests on the assumption that the existing papers are representative enough to reveal clear design spaces and gaps. Without seeing the full reference list and selection criteria, it is hard to judge whether important early or non-English work got left out. Surveys can also age quickly in a fast-moving subfield, so some of the stated challenges may already be shifting. The paper does not appear to have internal contradictions or unsupported derivations, since it is not deriving new results. It is the kind of reference that a lab working on reasoning alignment or test-time compute would keep on hand. I would bring it to a reading group for the discussion of benchmarks and applications, and I would cite it when writing an introduction or related-work section on fine-grained supervision. It is worth sending to peer review; a careful referee can check coverage and suggest missing threads without needing to rewrite the core organization.

Referee Report

0 major / 4 minor

Summary. The paper is a literature survey on Process Reward Models (PRMs) for large language models. It claims to provide a systematic overview covering the full pipeline: methods for generating process-level supervision data, techniques for building PRMs, their deployment for test-time scaling and reinforcement learning, applications across mathematics, code generation, text reasoning, multimodal tasks, robotics and agents, plus a review of emerging benchmarks. The stated goal is to map design spaces, surface open challenges, and guide future work on fine-grained reasoning alignment beyond outcome-only reward models.

Significance. A well-executed survey in this area would be useful for organizing the rapidly growing body of work on process supervision. By tracing the loop from data generation through model construction to downstream uses in scaling and RL, and by cataloguing applications and benchmarks, the paper could help researchers identify common design patterns and underexplored directions in moving from outcome to process signals.

minor comments (4)

[Introduction] The abstract and introduction assert a 'systematic' coverage of data generation, model building, and applications, but the manuscript should explicitly state the literature search strategy, inclusion criteria, and cut-off date to allow readers to assess completeness and potential selection bias.
[Benchmarks] In the section reviewing benchmarks, several emerging evaluation suites are listed; the paper would benefit from a comparative table that highlights differences in granularity (step-level vs. trajectory-level), domain coverage, and whether they provide human or automatic process annotations.
[Applications] When discussing applications in robotics and agents, the survey should clarify how PRM-style process signals are adapted to continuous or partially observable environments, as the transition from discrete reasoning steps is not always straightforward.
A small number of citations appear to be preprints or workshop papers; the authors should verify that all references have stable archival versions or DOIs where possible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the survey and for recommending minor revision. We appreciate the recognition that a well-executed overview of process supervision can help organize the rapidly growing literature and highlight underexplored directions.

Circularity Check

0 steps flagged

No significant circularity in this survey paper

full rationale

This paper is a literature survey that synthesizes existing research on Process Reward Models without any new derivations, equations, fitted parameters, predictions, or first-principles claims. It reviews data generation methods, model construction, test-time scaling, RL applications, and benchmarks across domains by referencing prior work. No load-bearing steps reduce to self-defined inputs or circular self-citations; the central contribution is a mapping of the design space and open challenges, which is self-contained and independent of any internal fitting or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, this work does not introduce or rely on new free parameters, axioms, or invented entities; it reviews existing research on PRMs.

pith-pipeline@v0.9.0 · 5675 in / 1065 out tokens · 41302 ms · 2026-05-18T09:08:06.716007+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PRMs address this gap by evaluating and guiding reasoning at the step or trajectory level.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
cs.AI 2025-10 unverdicted novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering
cs.AI 2026-05 unverdicted novelty 5.0

SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% averag...
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models
cs.CL 2026-04 unverdicted novelty 5.0

LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.