Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
hub
A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
10 Pith papers cite this work. Polarity classification is still indexing.
abstract
Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% average Hits@k gain on medical, legal, and CWQ tasks.
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
citing papers explorer
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
-
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering
SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% average Hits@k gain on medical, legal, and CWQ tasks.
-
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.