{"paper":{"title":"The Lessons of Developing Process Reward Models in Mathematical Reasoning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Beichen Zhang, Bowen Yu, Chujie Zheng, Dayiheng Liu, Jingren Zhou, Junyang Lin, Runji Lin, Yangzhen Wu, Zhenru Zhang","submitted_at":"2025-01-13T13:10:16Z","abstract_excerpt":"Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotat"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed biases in Best-of-N evaluation and the superiority of consensus filtering generalize beyond the specific models, datasets, and tasks tested in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5e3d762866e74fcff1b25558490584e9191ec7251ee4d27b9b689dad4ac13800"},"source":{"id":"2501.07301","kind":"arxiv","version":2},"verdict":{"id":"7f6ded8a-59e5-4661-ae61-5274d6070255","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:39:01.668243Z","strongest_claim":"we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives","one_line_summary":"Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed biases in Best-of-N evaluation and the superiority of consensus filtering generalize beyond the specific models, datasets, and tasks tested in the experiments.","pith_extraction_headline":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations."},"references":{"count":19,"sample":[{"doi":"","year":null,"title":"Alphamath almost zero: Process supervision without process","work_id":"142e2ffe-057a-4f74-9691-31fc3b21fb03","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","ref_index":2,"cited_arxiv_id":"2407.21783","is_internal_anchor":true},{"doi":"","year":null,"title":"Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback","work_id":"b8a7626e-ea9b-43ef-b087-69fd533b7413","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","ref_index":4,"cited_arxiv_id":"2103.03874","is_internal_anchor":true},{"doi":"","year":2022,"title":"Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra","work_id":"5a741743-7913-4194-8cac-fb0de071f2a8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":19,"snapshot_sha256":"7a3e3ea3a044ea349c24cc0415502cfb5fda21ee597249d33c7947001fbbbb5b","internal_anchors":8},"formal_canon":{"evidence_count":1,"snapshot_sha256":"fdec6b9749461edd2b56f0179cd7ad15b0990a3d0a6f934862987a4e65f2bcc3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}