{"paper":{"title":"Training Language Models to Self-Correct via Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Aleksandra Faust, Aviral Kumar, Avi Singh, Colton Bishop, Cosmin Paduraru, Disha Shrivastava, Doina Precup, Feryal Behbahani, George Tucker, John D Co-Reyes, Kate Baumli, Kay McKinney, Lei M Zhang, Rebecca Roelofs, Rishabh Agarwal, Shariq Iqbal, Vincent Zhuang, Yi Su","submitted_at":"2024-09-19T17:16:21Z","abstract_excerpt":"Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline mo"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6bb4a455e5bcb1ada76737ed42e60088da79ae7c9f10b067e62bf69e42870512"},"source":{"id":"2409.12917","kind":"arxiv","version":2},"verdict":{"id":"6b7aadc9-0ebb-4484-9e0b-4eb2cede54d3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:57:35.333634Z","strongest_claim":"With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.","pith_extraction_headline":"Multi-turn reinforcement learning trains language models to self-correct using only their own generated data."},"references":{"count":282,"sample":[{"doi":"","year":2024,"title":"Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs","work_id":"7bb8f9ec-1241-4472-a4fa-c636c6d79892","ref_index":1,"cited_arxiv_id":"2402.14740","is_internal_anchor":true},{"doi":"","year":2023,"title":"arXiv preprint arXiv:2305.08844 , year=","work_id":"f3a3dfae-1c31-4afc-b8c6-8af914a8fd24","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","ref_index":3,"cited_arxiv_id":"2108.07732","is_internal_anchor":true},{"doi":"","year":2023,"title":"Teaching Large Language Models to Self-Debug","work_id":"cdfb2680-220c-44eb-9edd-867b75fb821d","ref_index":5,"cited_arxiv_id":"2304.05128","is_internal_anchor":true},{"doi":"","year":2024,"title":"Teaching large language models to reason with reinforcement learning","work_id":"ab9d8347-574c-4fbb-b8e1-44eeba9c66b9","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":282,"snapshot_sha256":"aaeaa7885a3140d87b4d3e837841c5196862891dc1a5de333c8c325246b77a93","internal_anchors":63},"formal_canon":{"evidence_count":3,"snapshot_sha256":"4820a1b896368cf9a9f1d54d47c773f8e90d15fcf7190dcbf03c07c18cf3a22c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}