{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:IPGNRXOGQTIZKJIQVGLRWUR6R4","short_pith_number":"pith:IPGNRXOG","schema_version":"1.0","canonical_sha256":"43ccd8ddc684d1952510a9971b523e8f08a353fb05094a66dfef2a526f46bfb7","source":{"kind":"arxiv","id":"2501.07301","version":2},"attestation_state":"computed","paper":{"title":"The Lessons of Developing Process Reward Models in Mathematical Reasoning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Beichen Zhang, Bowen Yu, Chujie Zheng, Dayiheng Liu, Jingren Zhou, Junyang Lin, Runji Lin, Yangzhen Wu, Zhenru Zhang","submitted_at":"2025-01-13T13:10:16Z","abstract_excerpt":"Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotat"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2501.07301","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2025-01-13T13:10:16Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"d0d836b11be729d0489a5905659f20cb2d80a8e72807e76529642c710c26f9f0","abstract_canon_sha256":"c347e167b1e6e525c0aa0967effdee336a462359514005faddfc12323f8ee860"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.713513Z","signature_b64":"95pg/X1VZdvDSgnJfGKa0HTxxWOdQLVm3tH4U5eT063aK2/1PpinXixMaa1R0pW/pDirqeYy0M4OxwD4ZjqiDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"43ccd8ddc684d1952510a9971b523e8f08a353fb05094a66dfef2a526f46bfb7","last_reissued_at":"2026-05-17T23:38:47.712978Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.712978Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"The Lessons of Developing Process Reward Models in Mathematical Reasoning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Beichen Zhang, Bowen Yu, Chujie Zheng, Dayiheng Liu, Jingren Zhou, Junyang Lin, Runji Lin, Yangzhen Wu, Zhenru Zhang","submitted_at":"2025-01-13T13:10:16Z","abstract_excerpt":"Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotat"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed biases in Best-of-N evaluation and the superiority of consensus filtering generalize beyond the specific models, datasets, and tasks tested in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5e3d762866e74fcff1b25558490584e9191ec7251ee4d27b9b689dad4ac13800"},"source":{"id":"2501.07301","kind":"arxiv","version":2},"verdict":{"id":"7f6ded8a-59e5-4661-ae61-5274d6070255","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:39:01.668243Z","strongest_claim":"we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives","one_line_summary":"Monte Carlo data synthesis for PRMs underperforms LLM-judge and human methods, Best-of-N evaluations suffer from process-outcome misalignment and score inflation, and consensus filtering yields better PRMs with higher data efficiency.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed biases in Best-of-N evaluation and the superiority of consensus filtering generalize beyond the specific models, datasets, and tasks tested in the experiments.","pith_extraction_headline":"Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations."},"references":{"count":19,"sample":[{"doi":"","year":null,"title":"Alphamath almost zero: Process supervision without process","work_id":"142e2ffe-057a-4f74-9691-31fc3b21fb03","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","ref_index":2,"cited_arxiv_id":"2407.21783","is_internal_anchor":true},{"doi":"","year":null,"title":"Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback","work_id":"b8a7626e-ea9b-43ef-b087-69fd533b7413","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","ref_index":4,"cited_arxiv_id":"2103.03874","is_internal_anchor":true},{"doi":"","year":2022,"title":"Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra","work_id":"5a741743-7913-4194-8cac-fb0de071f2a8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":19,"snapshot_sha256":"7a3e3ea3a044ea349c24cc0415502cfb5fda21ee597249d33c7947001fbbbb5b","internal_anchors":8},"formal_canon":{"evidence_count":1,"snapshot_sha256":"fdec6b9749461edd2b56f0179cd7ad15b0990a3d0a6f934862987a4e65f2bcc3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2501.07301","created_at":"2026-05-17T23:38:47.713053+00:00"},{"alias_kind":"arxiv_version","alias_value":"2501.07301v2","created_at":"2026-05-17T23:38:47.713053+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2501.07301","created_at":"2026-05-17T23:38:47.713053+00:00"},{"alias_kind":"pith_short_12","alias_value":"IPGNRXOGQTIZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"IPGNRXOGQTIZKJIQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"IPGNRXOG","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2502.13957","citing_title":"Supervising the search process produces reliable and generalizable information-seeking agents","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2504.02181","citing_title":"A Survey of Scaling in Large Language Model Reasoning","ref_index":254,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09775","citing_title":"MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03403","citing_title":"Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15951","citing_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17497","citing_title":"Self-Supervised On-Policy Distillation for Reasoning Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2508.15202","citing_title":"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04142","citing_title":"Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02535","citing_title":"ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2505.18719","citing_title":"VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2603.25412","citing_title":"Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12652","citing_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12384","citing_title":"Scalable Token-Level Hallucination Detection in Large Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2502.17419","citing_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","ref_index":174,"is_internal_anchor":true},{"citing_arxiv_id":"2601.18734","citing_title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23318","citing_title":"Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23333","citing_title":"Process Supervision of Confidence Margin for Calibrated LLM Reasoning","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01203","citing_title":"GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04064","citing_title":"Improving Medical VQA through Trajectory-Aware Process Supervision","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10479","citing_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","ref_index":153,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15148","citing_title":"IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14641","citing_title":"Learning to Draw ASCII Improves Spatial Reasoning in Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18327","citing_title":"PARM: Pipeline-Adapted Reward Model","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15705","citing_title":"Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05226","citing_title":"Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning","ref_index":27,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4","json":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4.json","graph_json":"https://pith.science/api/pith-number/IPGNRXOGQTIZKJIQVGLRWUR6R4/graph.json","events_json":"https://pith.science/api/pith-number/IPGNRXOGQTIZKJIQVGLRWUR6R4/events.json","paper":"https://pith.science/paper/IPGNRXOG"},"agent_actions":{"view_html":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4","download_json":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4.json","view_paper":"https://pith.science/paper/IPGNRXOG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2501.07301&json=true","fetch_graph":"https://pith.science/api/pith-number/IPGNRXOGQTIZKJIQVGLRWUR6R4/graph.json","fetch_events":"https://pith.science/api/pith-number/IPGNRXOGQTIZKJIQVGLRWUR6R4/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4/action/timestamp_anchor","attest_storage":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4/action/storage_attestation","attest_author":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4/action/author_attestation","sign_citation":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4/action/citation_signature","submit_replication":"https://pith.science/pith/IPGNRXOGQTIZKJIQVGLRWUR6R4/action/replication_record"}},"created_at":"2026-05-17T23:38:47.713053+00:00","updated_at":"2026-05-17T23:38:47.713053+00:00"}