The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

Process Rewards with Learned Reliability

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 34
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 30
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
Process Rewards with Learned Reliability cs.CL · 2026-05-15 · unverdicted · none · ref 72
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

The lessons of developing process reward models in mathematical reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer