{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:PDCTJBTC6GDYS74PKQ3BCJJPGR","short_pith_number":"pith:PDCTJBTC","schema_version":"1.0","canonical_sha256":"78c5348662f187897f8f543611252f34475cba6704c6af5bff58952f261625f2","source":{"kind":"arxiv","id":"2312.08935","version":3},"attestation_state":"computed","paper":{"title":"Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.AI","authors_text":"Damai Dai, Deli Chen, Lei Li, Peiyi Wang, R.X. Xu, Yifei Li, Y.Wu, Zhifang Sui, Zhihong Shao","submitted_at":"2023-12-14T13:41:54Z","abstract_excerpt":"In this paper, we present an innovative process-oriented math process reward model called \\textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \\textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \\textit{Reinforcement Learning}: Math-"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2312.08935","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2023-12-14T13:41:54Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"58c714679bd1e6ffd8b0af13ba869ae97a1daf04659e61eab2fc800ac1e80bd8","abstract_canon_sha256":"91adbad2920422e205f37202aebe55cbf7b3d38603738d2a71ec7336af409f1b"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.443166Z","signature_b64":"NenUiQfXGeuy1TuEPAfCUplsnfNWrWFE2wwte8MoNneg6QVU86bTPhiJBpNbPc8ZvDUyDtWE6CfAVYriQ/5EDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"78c5348662f187897f8f543611252f34475cba6704c6af5bff58952f261625f2","last_reissued_at":"2026-05-17T23:39:21.442498Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.442498Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.AI","authors_text":"Damai Dai, Deli Chen, Lei Li, Peiyi Wang, R.X. Xu, Yifei Li, Y.Wu, Zhifang Sui, Zhihong Shao","submitted_at":"2023-12-14T13:41:54Z","abstract_excerpt":"In this paper, we present an innovative process-oriented math process reward model called \\textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \\textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \\textit{Reinforcement Learning}: Math-"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9%→84.1% on GSM8K and 28.6%→33.0% on MATH). The accuracy can be further enhanced to 89.1% and 43.5% on GSM8K and MATH with the verification of Math-Shepherd.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That automatically constructed process-wise supervision data accurately labels correct versus incorrect reasoning steps without systematic bias or noise from the generation process itself.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"31073cbe9ee9b6bb1cad67334a070dc6e908f2f330ea328d8692dcd2220d7195"},"source":{"id":"2312.08935","kind":"arxiv","version":3},"verdict":{"id":"d75c5099-d22d-4fbe-aebf-456ab69fe5f1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:29:38.954816Z","strongest_claim":"the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9%→84.1% on GSM8K and 28.6%→33.0% on MATH). The accuracy can be further enhanced to 89.1% and 43.5% on GSM8K and MATH with the verification of Math-Shepherd.","one_line_summary":"Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That automatically constructed process-wise supervision data accurately labels correct versus incorrect reasoning steps without systematic bias or noise from the generation process itself.","pith_extraction_headline":"Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations."},"references":{"count":60,"sample":[{"doi":"10.18653/v1/2022.emnlp-main.225","year":2022,"title":"Red teaming language models with language models","work_id":"664322b7-6ac6-46c5-b1f2-193a778945d2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"International Conference on Machine Learning , pages=","work_id":"1a650b5b-a768-4080-8629-3f3f3fe0d908","ref_index":10,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Proceedings of the 29th Symposium on Operating Systems Principles , pages=","work_id":"942205f6-1365-4509-8f46-700be8023817","ref_index":11,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Chi and Quoc V","work_id":"e5146dbc-54b0-40cd-a84c-3d72af59c83f","ref_index":13,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =. The Eleventh International Conference on Learning Represe","work_id":"871bab98-59d0-4eff-8c2d-56c7055fbe92","ref_index":14,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":60,"snapshot_sha256":"efd9458cd4f2b4aedc77964b90d0c75b1942d7b59014aec8b5e9018a7f2bae28","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"99ac36ec696a4495b7b2913802107dc7d447e534379a98c764abdc37ee66f038"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2312.08935","created_at":"2026-05-17T23:39:21.442604+00:00"},{"alias_kind":"arxiv_version","alias_value":"2312.08935v3","created_at":"2026-05-17T23:39:21.442604+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2312.08935","created_at":"2026-05-17T23:39:21.442604+00:00"},{"alias_kind":"pith_short_12","alias_value":"PDCTJBTC6GDY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"PDCTJBTC6GDYS74P","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"PDCTJBTC","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":39,"internal_anchor_count":39,"sample":[{"citing_arxiv_id":"2506.03530","citing_title":"How Far Are We from Generating Missing Modalities with Foundation Models?","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2402.03300","citing_title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2411.15594","citing_title":"A Survey on LLM-as-a-Judge","ref_index":160,"is_internal_anchor":true},{"citing_arxiv_id":"2501.19201","citing_title":"Efficient Reasoning with Hidden Thinking","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2504.02181","citing_title":"A Survey of Scaling in Large Language Model Reasoning","ref_index":206,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03403","citing_title":"Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16604","citing_title":"R2V Agent: Teaching SLMs When to Ask for Help","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17291","citing_title":"Step-wise Rubric Rewards for LLM Reasoning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2501.09732","citing_title":"Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17877","citing_title":"PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19228","citing_title":"Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20061","citing_title":"Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2507.15351","citing_title":"One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2508.15202","citing_title":"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":185,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07536","citing_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2503.06520","citing_title":"Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2501.09686","citing_title":"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models","ref_index":151,"is_internal_anchor":true},{"citing_arxiv_id":"2410.07985","citing_title":"Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":232,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12545","citing_title":"CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06592","citing_title":"Improve Mathematical Reasoning in Language Models by Automated Process Supervision","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2407.21787","citing_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03356","citing_title":"POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23333","citing_title":"Process Supervision of Confidence Margin for Calibrated LLM Reasoning","ref_index":73,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR","json":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR.json","graph_json":"https://pith.science/api/pith-number/PDCTJBTC6GDYS74PKQ3BCJJPGR/graph.json","events_json":"https://pith.science/api/pith-number/PDCTJBTC6GDYS74PKQ3BCJJPGR/events.json","paper":"https://pith.science/paper/PDCTJBTC"},"agent_actions":{"view_html":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR","download_json":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR.json","view_paper":"https://pith.science/paper/PDCTJBTC","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2312.08935&json=true","fetch_graph":"https://pith.science/api/pith-number/PDCTJBTC6GDYS74PKQ3BCJJPGR/graph.json","fetch_events":"https://pith.science/api/pith-number/PDCTJBTC6GDYS74PKQ3BCJJPGR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR/action/storage_attestation","attest_author":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR/action/author_attestation","sign_citation":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR/action/citation_signature","submit_replication":"https://pith.science/pith/PDCTJBTC6GDYS74PKQ3BCJJPGR/action/replication_record"}},"created_at":"2026-05-17T23:39:21.442604+00:00","updated_at":"2026-05-17T23:39:21.442604+00:00"}