{"work":{"id":"6d05b790-04c5-4fd2-91b2-ba1dfdd5770f","openalex_id":null,"doi":null,"arxiv_id":"2305.20050","raw_key":null,"title":"Let's Verify Step by Step","authors":null,"authors_text":"Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee","year":2023,"venue":"cs.LG","abstract":"In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.","external_url":"https://arxiv.org/abs/2305.20050","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:30:25.597541+00:00","pith_arxiv_id":"2305.20050","created_at":"2026-05-09T02:04:37.370179+00:00","updated_at":"2026-05-25T06:30:25.597541+00:00","title_quality_ok":false,"display_title":"Let's Verify Step by Step","render_title":"Let's Verify Step by Step"},"hub":{"state":{"work_id":"6d05b790-04c5-4fd2-91b2-ba1dfdd5770f","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":178,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2023-08-03T15:34:01+00:00","last_pith_cited_at":"2026-05-22T10:55:15+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T03:06:55.111407+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":25},{"context_role":"dataset","n":4},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":25},{"context_polarity":"use_dataset","n":4},{"context_polarity":"use_method","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Let's Verify Step by Step","claims":[{"claim_text":"In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Multi-turnOptimization ACTIVE-CRITIC [256], AUTOCALIBRATE [146], Auto-Arena [153, 286], LMExam [10], KIEval [273] Tuning-based (§4.1.2) Score-based TuningChen et al. [28], AttrScore [276], PHUDGE [47], ECT [229], SELF-J [266], SorryBench [249], TIGERScore [99],FENCE [252], ARES [188] Preference-basedLearning Meta-Rewarding [245], Con-J [270], JudgeLM [301], INSTRUCTSCORE [258], AUTO-J [130], Shepherd [232],X-EVAL [142], Themis [88], CritiqueLLM [106], FedEval-LLM [84], PandaLM [236], Self-Taught","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Reinforcement learning with verifiable rewards (RLVR) sidesteps learned reward models entirely by using deterministic verifiers, as demonstrated by approaches like SHARP [29] and SWiRL [26], which synthesize high-quality reasoning trajectories through verifiable rewards and step-wise optimization. Orthogonal to the reward source is the choice of RL training algorithm. DeepSeekMath [46] introduced Group Relative Policy Optimization (GRPO), which eliminates the need for a critic model by leveragin","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"the resulting driving term b tends to concentrate on parameter directions that most strongly affect those critical token predictions. This is consistent with the low-rank structure of ∆W observed in Section 3. Module-Wise Suppression (Functional Redundancy Avoidance).Decompose the parameters intoMmodules (e.g., embedding, attention, MLP layers). Write: ∆θ= (∆θ 1,∆θ 2, . . . ,∆θM), J c = [Jc,1, Jc,2, . . . , Jc,M],(52) whereJ c,m =∂z θ(c)/∂θm|θ0. Then the driving term for modulemis: bm =E c[J ⊤ c","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Let's Verify Step by Step because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-15T03:17:39.999734+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ba8429f2-0064-4e50-b6e5-ff02dd881d98","orcid":null,"display_name":"Hunter Lightman"},{"id":"edf3b705-ff6d-4713-9b26-27729234c00d","orcid":null,"display_name":"Vineet Kosaraju"},{"id":"ffac279e-a012-45b9-875d-270eb3f98c86","orcid":null,"display_name":"Yura Burda"},{"id":"d5c86cf6-d79b-4cf5-b628-e73978a352ec","orcid":null,"display_name":"Harri Edwards"},{"id":"d489d16b-253f-46a4-97a0-a0ded8d015e8","orcid":null,"display_name":"Bowen Baker"},{"id":"c6cfb8c7-bdfc-4d22-88ad-4c1a22a4b96d","orcid":null,"display_name":"Teddy Lee"}]},"error":null,"updated_at":"2026-05-15T03:17:39.994020+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:39:47.124455+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":37},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":25},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":25},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":18},{"title":"Solving math word problems with process- and outcome-based feedback","work_id":"94492239-b1d5-435e-bea5-7f51992d0614","shared_citers":16},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":16},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":11},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":10},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":9},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":9},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":9},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":9},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":9},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":9},{"title":"arXiv preprint arXiv:2312.08935 , year=","work_id":"fb547990-d48e-4ba3-a047-af1e506e8290","shared_citers":8},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":8},{"title":"Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z","work_id":"9835b482-5032-4135-93dd-82a066677569","shared_citers":8},{"title":"s1: Simple test-time scaling","work_id":"806265b1-8f22-48dd-b8ad-a99823b18fa4","shared_citers":8},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":8}],"time_series":[{"n":1,"year":2023},{"n":2,"year":2024},{"n":5,"year":2025},{"n":80,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:39:51.041130+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:39:25.309297+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Let's Verify Step by Step","claims":[{"claim_text":"In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Multi-turnOptimization ACTIVE-CRITIC [256], AUTOCALIBRATE [146], Auto-Arena [153, 286], LMExam [10], KIEval [273] Tuning-based (§4.1.2) Score-based TuningChen et al. [28], AttrScore [276], PHUDGE [47], ECT [229], SELF-J [266], SorryBench [249], TIGERScore [99],FENCE [252], ARES [188] Preference-basedLearning Meta-Rewarding [245], Con-J [270], JudgeLM [301], INSTRUCTSCORE [258], AUTO-J [130], Shepherd [232],X-EVAL [142], Themis [88], CritiqueLLM [106], FedEval-LLM [84], PandaLM [236], Self-Taught","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Reinforcement learning with verifiable rewards (RLVR) sidesteps learned reward models entirely by using deterministic verifiers, as demonstrated by approaches like SHARP [29] and SWiRL [26], which synthesize high-quality reasoning trajectories through verifiable rewards and step-wise optimization. Orthogonal to the reward source is the choice of RL training algorithm. DeepSeekMath [46] introduced Group Relative Policy Optimization (GRPO), which eliminates the need for a critic model by leveragin","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"the resulting driving term b tends to concentrate on parameter directions that most strongly affect those critical token predictions. This is consistent with the low-rank structure of ∆W observed in Section 3. Module-Wise Suppression (Functional Redundancy Avoidance).Decompose the parameters intoMmodules (e.g., embedding, attention, MLP layers). Write: ∆θ= (∆θ 1,∆θ 2, . . . ,∆θM), J c = [Jc,1, Jc,2, . . . , Jc,M],(52) whereJ c,m =∂z θ(c)/∂θm|θ0. Then the driving term for modulemis: bm =E c[J ⊤ c","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Let's Verify Step by Step because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-15T03:17:38.602761+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Let's Verify Step by Step","claims":[{"claim_text":"In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Let's Verify Step by Step because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:39:43.267266+00:00"}},"summary":{"title":"Let's Verify Step by Step","claims":[{"claim_text":"In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Let's Verify Step by Step because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":37},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":25},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":25},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":18},{"title":"Solving math word problems with process- and outcome-based feedback","work_id":"94492239-b1d5-435e-bea5-7f51992d0614","shared_citers":16},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":16},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":11},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":10},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":9},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":9},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":9},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":9},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":9},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":9},{"title":"arXiv preprint arXiv:2312.08935 , year=","work_id":"fb547990-d48e-4ba3-a047-af1e506e8290","shared_citers":8},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":8},{"title":"Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z","work_id":"9835b482-5032-4135-93dd-82a066677569","shared_citers":8},{"title":"s1: Simple test-time scaling","work_id":"806265b1-8f22-48dd-b8ad-a99823b18fa4","shared_citers":8},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":8}],"time_series":[{"n":1,"year":2023},{"n":2,"year":2024},{"n":5,"year":2025},{"n":80,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"d489d16b-253f-46a4-97a0-a0ded8d015e8","orcid":null,"display_name":"Bowen Baker","source":"manual","import_confidence":0.72},{"id":"d5c86cf6-d79b-4cf5-b628-e73978a352ec","orcid":null,"display_name":"Harri Edwards","source":"manual","import_confidence":0.72},{"id":"ba8429f2-0064-4e50-b6e5-ff02dd881d98","orcid":null,"display_name":"Hunter Lightman","source":"manual","import_confidence":0.72},{"id":"c6cfb8c7-bdfc-4d22-88ad-4c1a22a4b96d","orcid":null,"display_name":"Teddy Lee","source":"manual","import_confidence":0.72},{"id":"edf3b705-ff6d-4713-9b26-27729234c00d","orcid":null,"display_name":"Vineet Kosaraju","source":"manual","import_confidence":0.72},{"id":"ffac279e-a012-45b9-875d-270eb3f98c86","orcid":null,"display_name":"Yura Burda","source":"manual","import_confidence":0.72}]}}