Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.