{"paper":{"title":"Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Verifiable process supervision lets language models keep sound reasoning while achieving accurate answers, unlike accuracy-only reinforcement learning which trades reasoning quality for performance.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Chen Wei, Jinwoo Shin, Kevin Wang, Kyuyoung Kim, Peiyang Xu, Peiyao Sheng, Pramod Viswanath, Sewoong Oh, Yunfei Xie, Zhangyang Wang","submitted_at":"2026-04-03T15:19:46Z","abstract_excerpt":"Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntact"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That syntactic extraction of intermediate claims from the structured reasoning format will reliably produce evaluable steps that can be verified against ground-truth signals without introducing extraction errors or missing context.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Verifiable process supervision trains language models to produce accurate answers with sound, verifiable reasoning steps, outperforming accuracy-only reinforcement learning on chess by preserving accuracy while reducing reasoning errors.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Verifiable process supervision lets language models keep sound reasoning while achieving accurate answers, unlike accuracy-only reinforcement learning which trades reasoning quality for performance.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"717eff95a0b965ffa8e92519e912b585fe74f603d6d27208b44783af388689e0"},"source":{"id":"2605.12519","kind":"arxiv","version":1},"verdict":{"id":"cc1c39bb-2182-42c7-830c-a7cbe2265116","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T21:24:50.295762Z","strongest_claim":"While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation.","one_line_summary":"Verifiable process supervision trains language models to produce accurate answers with sound, verifiable reasoning steps, outperforming accuracy-only reinforcement learning on chess by preserving accuracy while reducing reasoning errors.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That syntactic extraction of intermediate claims from the structured reasoning format will reliably produce evaluable steps that can be verified against ground-truth signals without introducing extraction errors or missing context.","pith_extraction_headline":"Verifiable process supervision lets language models keep sound reasoning while achieving accurate answers, unlike accuracy-only reinforcement learning which trades reasoning quality for performance."},"references":{"count":17,"sample":[{"doi":"","year":null,"title":"**Candidate selection**: whether the moves analyzed are reasonable to consider, using the engine summary as a reference for which candidates are meaningful","work_id":"dffc4231-7540-4621-bb35-fe0601dd28a2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"this is a check","work_id":"8c316e25-a7f4-4f55-93ea-dc6362875445","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Use the engine summary to identify the top candidates","work_id":"d5209375-f574-4140-8832-94cfb6f207d4","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Identify which moves the trace analyzes and whether the top move is present","work_id":"c33feeac-f75b-4d5d-9024-c7a71e3ea423","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"For each candidate, assess whether the justification is position-specific or merely generic","work_id":"6732ce55-a0f2-4cbe-b653-0a8d7e1c138d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":17,"snapshot_sha256":"2381c7484433c9ed48e9c0665845041d3d4d167038e8b20585322db6cbaad162","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}