{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:OQMAWUXNEGORB2TU7H6MP4SDXQ","short_pith_number":"pith:OQMAWUXN","schema_version":"1.0","canonical_sha256":"74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030","source":{"kind":"arxiv","id":"2409.12917","version":2},"attestation_state":"computed","paper":{"title":"Training Language Models to Self-Correct via Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Aleksandra Faust, Aviral Kumar, Avi Singh, Colton Bishop, Cosmin Paduraru, Disha Shrivastava, Doina Precup, Feryal Behbahani, George Tucker, John D Co-Reyes, Kate Baumli, Kay McKinney, Lei M Zhang, Rebecca Roelofs, Rishabh Agarwal, Shariq Iqbal, Vincent Zhuang, Yi Su","submitted_at":"2024-09-19T17:16:21Z","abstract_excerpt":"Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline mo"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.12917","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2024-09-19T17:16:21Z","cross_cats_sorted":[],"title_canon_sha256":"8ae34f7e383954bfbcd3c1e6be66a80a8a987cc48e21d4f15ed4522b6724ad23","abstract_canon_sha256":"25112ca15391c914fbcb0c3c54f8027aed431f584ff13d6753715ac632734758"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.175648Z","signature_b64":"HY6xwhUoSfVrJGUtEBiRLXsQ2u6V9iibnSxOZZYhqVCsF3ZjQX9T15+LLcJZCR1lPloM0SUtuaVE8GZtd9dIBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030","last_reissued_at":"2026-05-17T23:38:14.174994Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.174994Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Training Language Models to Self-Correct via Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Aleksandra Faust, Aviral Kumar, Avi Singh, Colton Bishop, Cosmin Paduraru, Disha Shrivastava, Doina Precup, Feryal Behbahani, George Tucker, John D Co-Reyes, Kate Baumli, Kay McKinney, Lei M Zhang, Rebecca Roelofs, Rishabh Agarwal, Shariq Iqbal, Vincent Zhuang, Yi Su","submitted_at":"2024-09-19T17:16:21Z","abstract_excerpt":"Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline mo"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6bb4a455e5bcb1ada76737ed42e60088da79ae7c9f10b067e62bf69e42870512"},"source":{"id":"2409.12917","kind":"arxiv","version":2},"verdict":{"id":"6b7aadc9-0ebb-4484-9e0b-4eb2cede54d3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:57:35.333634Z","strongest_claim":"With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.","pith_extraction_headline":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data."},"references":{"count":282,"sample":[{"doi":"","year":2024,"title":"Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs","work_id":"7bb8f9ec-1241-4472-a4fa-c636c6d79892","ref_index":1,"cited_arxiv_id":"2402.14740","is_internal_anchor":true},{"doi":"","year":2023,"title":"arXiv preprint arXiv:2305.08844 , year=","work_id":"f3a3dfae-1c31-4afc-b8c6-8af914a8fd24","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","ref_index":3,"cited_arxiv_id":"2108.07732","is_internal_anchor":true},{"doi":"","year":2023,"title":"Teaching Large Language Models to Self-Debug","work_id":"cdfb2680-220c-44eb-9edd-867b75fb821d","ref_index":5,"cited_arxiv_id":"2304.05128","is_internal_anchor":true},{"doi":"","year":2024,"title":"Teaching large language models to reason with reinforcement learning","work_id":"ab9d8347-574c-4fbb-b8e1-44eeba9c66b9","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":282,"snapshot_sha256":"aaeaa7885a3140d87b4d3e837841c5196862891dc1a5de333c8c325246b77a93","internal_anchors":63},"formal_canon":{"evidence_count":3,"snapshot_sha256":"4820a1b896368cf9a9f1d54d47c773f8e90d15fcf7190dcbf03c07c18cf3a22c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.12917","created_at":"2026-05-17T23:38:14.175112+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.12917v2","created_at":"2026-05-17T23:38:14.175112+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.12917","created_at":"2026-05-17T23:38:14.175112+00:00"},{"alias_kind":"pith_short_12","alias_value":"OQMAWUXNEGOR","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"OQMAWUXNEGORB2TU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"OQMAWUXN","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2511.13026","citing_title":"REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2503.06520","citing_title":"Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2510.02283","citing_title":"Self-Forcing++: Towards Minute-Scale High-Quality Video Generation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2501.09686","citing_title":"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2412.18925","citing_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14539","citing_title":"Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":174,"is_internal_anchor":true},{"citing_arxiv_id":"2412.21187","citing_title":"Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs","ref_index":277,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26644","citing_title":"When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09121","citing_title":"A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08936","citing_title":"Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25182","citing_title":"CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21268","citing_title":"Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07248","citing_title":"PaT: Planning-after-Trial for Efficient Test-Time Code Generation","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05226","citing_title":"Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01567","citing_title":"Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04291","citing_title":"Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion","ref_index":56,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ","json":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ.json","graph_json":"https://pith.science/api/pith-number/OQMAWUXNEGORB2TU7H6MP4SDXQ/graph.json","events_json":"https://pith.science/api/pith-number/OQMAWUXNEGORB2TU7H6MP4SDXQ/events.json","paper":"https://pith.science/paper/OQMAWUXN"},"agent_actions":{"view_html":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ","download_json":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ.json","view_paper":"https://pith.science/paper/OQMAWUXN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.12917&json=true","fetch_graph":"https://pith.science/api/pith-number/OQMAWUXNEGORB2TU7H6MP4SDXQ/graph.json","fetch_events":"https://pith.science/api/pith-number/OQMAWUXNEGORB2TU7H6MP4SDXQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ/action/storage_attestation","attest_author":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ/action/author_attestation","sign_citation":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ/action/citation_signature","submit_replication":"https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ/action/replication_record"}},"created_at":"2026-05-17T23:38:14.175112+00:00","updated_at":"2026-05-17T23:38:14.175112+00:00"}